Next: Glass box approach Up: Synthesis assessment Previous: Methodology

Black box approach

Black box assessment tests a system's performance as a whole, without considering the performance of modules internal to the system. Ideally, within black box testing, one would want to have at one's disposal a functional test to assess the adequacy of the complete speech output in all respects: does the output function as it should? Such a test does not exist, and is difficult to conceive. In practice, the functional quality of overall speech output has often been equated with comprehensibility: to what extent can synthesised continuous speech be understood by listeners?

Speech comprehension is a complex process involving the interpretation and integration of many sources of information. Important sources of information in complete communication situations, where both auditory and visual information are available to interactants, are:

Speech signal information at different levels (segments, prosody, voice characteristics),
Segment combinatory probabilities (e.g. /str../ is a permissible consonant sequence at the onset of words in many EU languages, but all other permutations of this sequence are illegal (e.g. /tsr.../,
Knowledge of which segment strings are existing words in the language (e.g. the permissible string /strIk/ is not a word in English),
Word combinatory probabilities (e.g. the article ``the'' will tend to be followed by nouns rather than verbs),
Semantic coherence (e.g. in the context of ``arrive'' a word like ``train'' is more probable than a word like ``pain''),
Meaning extracted from the preceding linguistic context; due to the repetition of words and the progressive building up of meaning, the last sentence of a text will generally be easier to understand than the first,
World knowledge and expectations of the listener based on previous experience,
Cues provided by the extra-linguistic context in which the message is spoken (e.g. facial expressions and gestures of the speaker, relevant things happening in the immediate environment).

In normal daily life all these different sources, and others, may be combined by listeners to construct the meaning of a spoken message. As a result, in applied contexts the contributions of separate sources are difficult to assess. Laboratory tests typically try to minimise or control for the effects of at least some of the sources in order to focus on the auditory input. Some segmental intelligibility tests at the word level (such as the SAM Standard Segmental Test, see Appendix I A) try to minimise the effects of all sources except (1) and (2): only meaningless but permissible consonant-vowel-consonant combinations (e.g. /hos/) or even shorter items (/ze, ok/) are presented to the listener. In comprehensibility tests, factor (8) is excluded completely and (7) as far as possible. The latter is done by selecting texts with supposedly novel information for all subjects.

No completely developed standardised test, with fixed test material and fixed response categories, for evaluating comprehension is available, but one wonders whether this would be very useful in the first place, since it is not clear what the ``average'' text to be used should look like in terms, for example, of the complexity and type of vocabulary, grammatical structures, sentence length, and style. At this level of evaluation it seems a good idea to take the characteristics of the intended application into account . Testing the comprehensibility of speech output destined to provide traffic information asks for a more specific type of test materials (e.g. short sentences, only statements, restricted range of lexical items, formal style) than speech output to be used for reading a digital daily newspaper for the blind, where the test materials should be more varied in all respects. The greatest variation should probably be present in speech material testing text-to-speech systems developed to read novels to the visually handicapped.

As to the type of comprehension test, several general approaches can be outlined. The most obvious one involves the presentation of synthesised texts at the paragraph level, preferably with human produced versions as a topline control, with a series of open or closed (multiple choice) questions. Results are expressed in terms of percent correct responses. An example of a closed response approach is =1 (

; Pisoni Greene Nusbaum 1985) , =1 (

; Pisoni Nusbaum Greene 1985) , who used 15 narrative passages selected from standardised adult reading comprehension tests. Performance was compared between listening to synthetic speech, listening to human speech, and silent reading. Each condition was tested with 20 subjects. One of the most important findings was a strong learning effect for synthetic speech within a very short time and the absence of clear differences among the test conditions.

At first sight, the results of closed response comprehension tests seem counterintuitive: although the human produced texts sound better than the synthetic version, often no difference in comprehension is revealed ( =1 (

; Nye Ingemann Donald 1975) ;

=1 (

; Delogu et al. 1992) ) or, after a short period of familiarisation, even superior performance for synthetic speech ( =1 (

; Pisoni Greene Nusbaum 1985) , =1 (

; Pisoni Nusbaum Greene 1985) ) is observed. These results have been tentatively explained by hypothesising that subjects may make more of an effort to understand synthetic speech. This could be expected to lead to:

Slower reaction times in a sentence verification test.
A decrease in performance as a function of fatigue.
Poorer performance for secondary tasks.

Confirmation of the first prediction was found by =1 (

; Manous et al. 1985) . The second and third predictions were tested by =1 (

; Luce Feustel Pisoni 1983) , using a word recall test, and by

=1 (

; Boogaart Silverman 1992) , using a tracking task. The first study revealed a significant effect, whereas the second did not.

However, the lack of differentiation in comprehensibility between human and synthetic speech in the above studies may also be due to the use of the closed response approach, where subjects have a fair chance of guessing the correct answer. Open response tests are known to be more sensitive, i.e. more apt to bring to light differences among test conditions. An example of an open response study is =1 (

; Van Bezooijen 1989) . She presented five types of texts typically found in daily Dutch newspapers, pertaining to the weather, nature, disasters, small events, and sports, to 16 visually handicapped subjects. An example of a question testing the comprehensibility of the weather forecasts is: What will the temperature be tomorrow? The questions were sensitive enough to yield significant differences in comprehensibility among two text-to-speech converted (one automatic and one manually corrected) and a human produced version of the texts. Crucially, the results also suggest that the effect of the supposedly greater effort expended in understanding synthetic speech has its limits. If the synthetic speech is bad enough, increased effort cannot compensate for loss of quality.

The tests described ask subjects to answer questions after that the texts have been presented, thus measuring the final product of text interpretation. In addition to these off-line tests, more psycholinguistically oriented on-line approaches have been developed which request instantaneous reactions to the auditory material being presented. These tests primarily aim at gaining insight into the cognitive processes underlying comprehension: to what extent is synthetic speech processed differently from human speech? To name but a few of these psycholinguistic tests:

The word monitoring task, requiring subjects to press a button as soon as they hear a prespecified word,
The sentence-by-sentence listening task, in which subjects push a button whenever they are ready for hearing the next sentence (comprehension is checked afterwards but is not part of the test proper),
The sentence verification test, where subjects have to decide whether short sentences are true statements or not (e.g. Mud is dirty and Rockets move slowly).

All three are on-line measures, the first indexing cognitive workload, the second and third assessing speed of comprehension. On-line tests of this type, which invariably reveal differences between human and synthetic speech, have been hypothesised to be more sensitive than off-line measures ( =1 (

; Ralston et al. 1991) ). However, the results of such psycholinguistic tests (``subjects responded significantly faster to system A (740 ms) than to system B (930 ms)'') are less interpretable for non-scientists than those of comprehension tests (``subjects answered 74% of the system A questions correctly versus 93% of the system B questions). On the other hand, insight into cognitive load may ultimately prove important in double task applications.

Recommendation 21
Try to avoid the use of functional tests to assess overall output quality: on-line reaction time tests are difficult to interpret and off-line comprehension tests are difficult to develop.

Recommendation 22
If determined to develop a comprehension test, beware of the fact that reading tests may be too compact to be used as listening tests; adapt the materials or use materials that are meant to be listened to.

Recommendation 23
Use open comprehension questions rather than closed ones, the former being more sensitive than the latter.

Recommendation 24
When administering a comprehension test, include a top-line reference with a dedicated speaker realising exactly the same texts presented in a synthetic version; use different groups of subjects for the various speech conditions (or better still block conditions over listeners such that no listener hears more than one version of the same text while at the same time each listener gets an equal number of different text versions).

Recommendation 25
When interpreting comprehension results, look at difference scores (synthetic compared to human) rather than at absolute scores to abstract from the intrinsic difficulty of questions.

Judgment laboratory tests

The black box tests described so far are functional in nature. However, instead of evaluating overall quality functionally, subjects can also indicate their subjective impression of global quality aspects of synthetic output by means of rating scales. Taking comprehensibility as an example, a functional task would be one where subjects answer a number of questions related to the content of a text passage as described above. Alternatives from a judgment point of view include:

Paired comparison, where subjects indicate which of two synthesisers sounds more comprehensible,
Magnitude estimation, where subjects assign a value expressing, or draw a line of a length which is equal to the magnitude of, their impression of comprehensibility,
Categorical estimation, where subjects rate synthesisers, for instance, along a 10-point scale which runs from 1: extremely incomprehensible to 10: extremely comprehensible.

Some methodological aspects of the second and third method are described in detail in section . There it is also indicated that magnitude estimation is relatively laborious and more fit for test external comparison, whereas categorical estimation is relatively fast and easy, and more fit for test internal comparison.

Both the magnitude (continuous scale) and categorical estimation (20-point scale) methods have been included in SOAP in the form of the SAM Overall Quality Test (see K in Appendix 1). Three scales are recommended, related to:

Intelligibility (How identifiable does the message sound?),
Naturalness (To what extent does the message sound like being produced by a human speaker?),
Acceptability (The overall user's satisfaction with the communicative situation).

The intelligibility and naturalness ratings are based on pairs of (unrelated) sentences. Fixed lists of 160 sentences of varying content and length are available for Dutch, English, French, German, Italian, and Swedish. Examples for English are: I realise you're having supply problems but this is rather excessive and I need to arrive by 10.30 a.m. on Saturday. For the acceptability ratings, application specific test materials are recommended. The magnitude and categorical estimation procedures have been applied to speech output in a number of studies (e.g. =1 (

; Pavlovic Rossi Espesser 1990) ; =1 (

; Delogu et al. 1991) ;

=1 (

; Goldstein Lindström Till 1992) ). Methodological aspects, such as the effects of stimulus range and the number of categories, relationships among methods, reliability, and validity, are emphasised.

The importance of application-specific test materials is also stressed by the International Telecommunication Union Telecommunication Standardisation (ITU-T) sector (see L in Appendix 1). They developed a test specifically aimed at evaluating the quality of telephone speech (where synthesis could be the input). It is a categorical estimation judgment test comprising ratings on (a subset of) eight scales:

Acceptance
Overall impression
Listening effort
Comprehension problems
Articulation
Pronunciation
Speaking rate
Voice pleasantness

The first scale is a 2-point scale, the other ones are 5-point scales. Strictly speaking, only the first four scales can be captured under the heading overall quality; the other four scales are directed at more specific aspects of the output and require analytic listening. The content of the speech samples presented should be in accordance with the application. Examples of application-specific test items are: Miss Robert, the running shoes Adidas Edberg Pro Club, colour: white, size: 11, reference: 501-97-52, price 319 francs, will be delivered to you in 3 weeks (mail order shopping) and The train number 9783 from Poitiers will arrive at 9:24, platform number 3, track G (railway traffic information). In addition to rating the eight scales, subjects are required to reproduce information contained in the message. A pilot study has been run by =1 (

; Cartier et al. 1992) .

=1 (

; Fellbaum Klaus Sotscheck 1994) tested 13 synthesis systems for German using the ITU-T Overall Quality Test as well as open response functional intelligibility tests. Waveform concatenative synthesis systems proved measurably better than formant synthesis systems.

=1 (

; Van Bezooijen Jongenburger 1993) employed a similar series of judgment scales as proposed by the ITU-T in a mixed laboratory/field study which addressed the suitability of synthetic speech within the context of a digital daily newspaper for the blind (see section ). Their battery comprised ten 10-point scales:

Intelligibility
General quality
Naturalness
Precision of articulation
Accuracy of pronunciation
Pleasantness of voice
Adequacy of word stress
Appropriateness of tempo
Liveliness
Fluency

Again a distinction can be made between scales relating to overall quality (the first three scales), and the other scales, relating to specific aspects of the speech output. A factor analysis yielded two factors, the first with high loadings of intelligibility, general quality, and precision of articulation, the second with high loadings of naturalness, pleasantness of voice, and adequacy of word stress. Intelligibility and naturalness were taken by the authors to be the two central dimensions underlying the evaluative judgments.

Recommendation 26
Since there is no consensus on the most appropriate judgment scales to evaluate overall quality, choose between:

Intelligibility, naturalness, and acceptability (SAM Overall Quality Test),
Acceptance, overall impression, listening effort, and comprehension problems (ITU-T), or only listening effort (practice in telephony),
Intelligibility, general quality, and naturalness ( =1 (
; Van Bezooijen Jongenburger 1993) )

Recommendation 27
It is important that the scale positions have a clear meaning for the subjects and that the scale is wide enough to allow differentiation among systems compared. Use at least a 10-point scale.

Field testing

Preliminary remarks

In the previous section, the black box approach to speech output evaluation was operationalised within a laboratory context. From an experimental point of view, the main advantage of a laboratory study is the control over possibly interfering factors. However, ultimately it is the functioning of a speech output system in real life, with all its variability, that counts. If overall quality is extended to include all aspects of the synthesis in the context of an application, testing may be necessary in the field. Due to the variety of applications, it is difficult to summarise the aspects which field tests have in common. To illustrate the diversity, below some examples will be given.

Field tests

A combined laboratory/field functional/judgment test, with equal attention to the speech output itself and the context within it is used, was done by

=1 (

; Van Bezooijen Jongenburger 1993) . They used the following suite of four tests to evaluate the functioning of an electronic newspaper for the visually handicapped:

An interview enquiring after the subjects' attitudes towards the technology,
A functional open response identification test with CVC-words,
Judgments on 10 evaluative scales for continuous text passages (1: extremely bad, 10: extremely good) related to global quality and more specific aspects of the speech output, such as pleasantness of voice (see ), adequacy of word stress, appropriateness of tempo, liveliness, and fluency (see for further details),
A functional test assessing the subjects' proficiency in finding their way through the newspaper. This test involved a number of searches, such as Is there an article on Japan in the economy section? Proficiency was assessed both in terms of percent correct answers and task completion time.

Each of 24 visually handicapped subjects was visited at home, at three points in time. Since the subjects lived scattered all over the Netherlands, administration of the suite of tests was very time consuming.

Comparable studies have been conducted to evaluate a digital daily newspaper in Sweden ( =1 (

; Hjelmquist Jansson Torell 1987) ). However, the experimental set-up to assess the quality of various aspects of the Swedish speech output was less strict: most information was obtained through interviews. On the other hand, much emphasis was placed upon the reading habits of the users: all keystrokes were registered during long periods of time, so that the frequency of use of all reading commands (e.g. next sentence, previous sentence, talk faster, talk in letters) could be determined.

A semi-field study combining function and judgment testing within the context of telephone information services was done by =1 (

; Roelofs 1987) . In this test resynthesised human speech was used, but the set-up and results can be generalised to synthetic speech output. Two applications were considered, namely directory assistance (the subject puts his request to an operator and then the number is spoken by the computer twice, thus freeing the operator for the next subscriber) and a service for train departure times (in a single pre-stored message the departure times of a number of trains with different destinations are given). In the former application a human operator served as a reference, in the second high-quality PCM speech was presented. Subjects were sent the instructions in advance and dialled the two services from their homes. The availability of interrupt facilities and speaking rate were examined. Both actual performance (success in writing down the requested data) and subjective reactions were registered (14 5-point scales such as bad--good, impersonal--personal, inefficient--efficient) and two questions, namely: Do you find this way of information presentation acceptable? and Do you think this service could replace the current service? Due to several factors, the results are of limited value. However, the method is a nice example of how different approaches to testing can be combined.

With a view of exploring the possibilities of synthetic speech for a name and address telephone service, =1 (

; Delogu et al. 1993) tested six Italian TTS systems by presenting lexically unpredictable VCV and CV sequences in an open response format. Intelligibility scores dropped from 31 to 21 percent when the same materials were listened to through a telephone line rather than good quality headphones. Curiously enough, the best TTS systems suffered most from telephone bandwidth limitation.

Finally, an important area of speech output evaluation are applications where people are required to process auditory information while simultaneously performing some other task, involving hands and eyes, for instance to write down a telephone number or land an aircraft. The requirements imposed by double tasks like these have been simulated for instance by having subjects answer simple questions related to the content of short synthesised messages while at the same time tracking a randomly moving square on a video monitor by moving a mouse ( =1 (

; Boogaart Silverman 1992) ). This type of laboratory study could and should be extended to more real-life situations. Other important areas are field tests where the functioning of speech output is tested under various noise conditions, and combinations of noise and secondary tasks.

Since field tests will often have to meet specific requirements, it is not realistic to think in terms of standard tests and standard recommendations. Each case will have to be examined in its own right. In order to get an overview of complex test situations that may arise,

=1 (

; Jekosch Pols 1994) recommend a ``feature'' analysis to define a test set-up, where features are all aspects relevant to the choice of the test. Their analysis comprises three steps, naturally leading to a fourth step:

Determine the application conditions (What is to be tested? What are the properties of the material generated?), resulting in a feature profile of the application scenario.
Define the best possible test matching this feature profile.
Make a comparison of what is desired and what is available in terms of tests.
Adapt tests or develop your own test.

Because of the specific nature of some applications, often there will be no ready-made test available, so that it is perhaps better to talk of (suggestions for) test approaches than tests. Moreover, a single test will generally not suffice, but a suite of tests will be needed instead. In this suite both functional and judgment tests can be included. Interviews can be part of the evaluation as well. Moreover, it is possible to administer laboratory type experiments in a field situation. This can be done, for example, by preparing stimulus tapes beforehand and playing them to subjects in the environment where the synthesis system will be used.

Next: Glass box approach Up: Synthesis assessment Previous: Methodology

WWW Administrator
Fri May 19 11:53:36 MET DST 1995

Black box approach

Laboratory testing

Functional laboratory tests

Judgment laboratory tests

Field testing

Preliminary remarks

Field tests