next up previous contents
Next: Black box approach Up: Synthesis assessment Previous: Towards a taxonomy

Methodology

The great majority of speech output assessment techniques use listening experiments involving human subjects, i.e. functional and/or judgment tests of speech output at the acoustic level. In the following subsections we will discuss a number of methodological issues that are relevant especially to this type of testing. The issues concern the choice of subjects, test procedures, benchmarks and reference conditions, and precautions to ensure cross-language comparability. Although there is no a priori reason why this should be so, no accepted methodology seems to exist for other types of speech output evaluation techniques. As will be obvious from later sections, for example, no accepted methodology can be identified in the field of output evaluation at the symbolic linguistic level. It is unclear in this area what kinds of textual materials should be used in tests, what error categories should be distinguished, and what scoring procedures should be used. We will therefore limit the methodological discussion to acoustic output testing techniques involving human listeners.

Subjects

One of the most important aspects of a measuring instrument is its reliability. How reliable, for example, is subjects' performance in functional intelligibility tests when tested several times? Test/retest intrasubjective reliability of intelligibility was assessed by

=1 (

; Logan Greene Pisoni 1989) ) and =1 (

; Van Bezooijen 1988) ); in both cases it was found to be good. More attention has been paid to subject dimensions systematically affecting intersubjective reliability. This research was motivated by the finding of large variance in the test scores, possibly obscuring effects of the synthesis systems compared. Most studies in this area examined variability in intelligibility scores. Subject dimensions considered relevant include: age, non-expert experience with synthetic speech, expert experience with synthetic speech, and analytic listening.

Within the ESPRIT-SAM project ( =1 (

; Howard-Jones 1992a) ,

=1 (

; Howard-Jones 1992b) ), the effect of age was examined with Italian VCV-items. Five age categories were distinguished (10--19, 20--29, 30--44, 45--59, over 60), with between 5 and 8 subjects per group. The group scores of percentages correct consonant identification ranged from 58%, for the oldest group, to 64%, for the youngest group. So, little evidence was found for an effect of the subject dimension age.

Non-expert experience with synthetic speech was investigated in several studies.

=1 (

; Howard-Jones 1992a) , =1 (

; Howard-Jones 1992b) compared the performance of 8 subjects experienced with synthetic speech and 24 inexperienced subjects. German VCV-items were presented. The mean score for the experienced subjects was 79%, that for the inexperienced subjects 62%. There is further evidence that the intelligibility of synthetic speech increases as a result of non-expert experience with synthetic speech, both when acquired in the form of training with feedback (e.g. =1 (

; Greenspan Nusbaum Pisoni 1985) ; =1 (

; Schwab Nusbaum Greene 1985) ) and when acquired in a more natural way without feedback ( =1 (

; Pisoni Greene Nusbaum 1985) , =1 (

; Pisoni Nusbaum Greene 1985) ;

=1 (

; Boogaart Silverman 1992) ). The learning effect has been found to manifest itself after only a few minutes of exposure. However, there are indications that the effect of learning depends on the type of synthesis used. =1 (

; Jongenburger Van Bezooijen 1992) ) assessed the intelligibility of two synthesis systems used by visually handicapped for reading a digital daily newspaper in a first confrontation and after one month of experience. An open response CVC identification test was used. For one system, which was allophone based, consonant intelligibility increased from 58% to 79%; for the other system, which was diphone based, intelligibility increased from 63% to 68%. It was hypothesised that the characteristics of allophone-based synthesis are easier to learn because they are rule-governed and therefore more invariant than those of diphone-based synthesis. Moreover, no transfer was found from experience with one type of synthesis to the understanding of the other type of synthesis. This suggests that there is no such thing as general experience in listening to synthetic speech.

The subject dimension expert experience with synthetic speech was examined by

=1 (

; Howard-Jones 1992a) with English VCV-items. A percentage correct consonant identification of 30% was obtained for the inexperienced subjects versus 49% for the experts. So, again improved performance was found as a function of increased exposure.

The last subject dimension we want to mention is experience in listening analytically to speech. On the basis of a reanalysis of the results from a number of their evaluation studies, =1 (

; Van Bezooijen Pols 1993) ) conclude that the more ear-training subjects have, the higher the percentages correct they attain. Furthermore, ear-training was found to result in a reduction of intersubjective differences.

Apart from variance which can be attributed to particular subject dimensions, much apparently individual variability is found in test scores. =1 (

; Hazan Shi 1993) examined the variance in subject scores in various tests, including intelligibility of meaningless VCV-items, intelligibility of Semantically Unpredictable Sentences (SUS, see Appendix 1 G), and speech pattern identification for plosive place and voicing contrasts. A homogeneous group of subjects was used:

Despite the homogeneity of the subject group, a sizeable degree of variability was found in all tests. For the SUS the range (i.e the difference between the best and worst performing subject) was 28%, for the CVC-test the range was 47%. At the level of speech pattern processing, considerable differences were found in the perceptual weighting given to individual cues to plosive place and voicing contrasts. Hazan & Shi attribute the variability not to audiological differences among listeners, but to the development of different perceptual strategies during language acquisition. They distinguish two types of listeners: ``auditors'' (i.e. users of acoustic information) and ``comprehenders'' (i.e. users of global contextual information).

Having established that there is much variability in the scores obtained in speech output evaluation tests, part of which can be attributed to clearly identifiable subject dimensions such as previous experience with synthetic speech, one may wonder what implications this has for the selection of subjects in specific tests. We think that the implications for subject selection depend in part on the type of test administered: black box versus glass box, comparative versus diagnostic, application for the general public versus application for a specific group of users, etc. However, some general recommendations can be formulated as well.

Recommendation 4
Always exclude hearing-impaired subjects from speech output assessment. Within the SAM-project ( =1 (

; Howard-Jones 1992a) , =1 (

; Howard-Jones 1992b) ) it is specified that subjects should pass the hearing screening test at 20 dB HL at all octave frequencies from 500 to 4000 Hz.

Recommendation 5
Never use the same subject more than once, unless, of course, one is interested in the effect of repeated exposure.

Recommendation 6
In diagnostic testing only include subjects speaking the same language (variety) as the language (variety) tested.

Recommendation 7
In diagnostic testing, hire a trained phonetician (with a basic understanding of the relationships between articulation and acoustics) in the initial stages of development of a system in order to obtain subtle information (e.g. degree of voicing in plosives), or information that is usually not used for functional purposes in real-life communication (e.g. formal aspects of temporal organisation and intonation, cf. =1 (

; Terken 1993) ).

The following approach is recommended not only because of (possible) differences in the perception of the speech output, but also because motivation is known to play an important role in the effort people are willing to spend in order to understand suboptimal speech. If people have a choice between human and synthetic speech, the synthetic speech will have to be good if it wants to have a chance of being accepted. However, if people do not have a choice, e.g. the visually handicapped who without synthesis (or braille) will not have access to a daily newspaper, synthesis will be accepted more easily.

Recommendation 8
In specialised applications, select subjects who are representative of the (prospective) users. For example, synthesis integrated in a reading machine for the blind should be tested with visually handicapped. And synthesis for long-term use should be tested with subjects with different degrees of experience and familiarisation with the type of synthetic speech of interest.

Recommendation 9
Synthesis to be used by the general public for incidental purposes, i.e. which should be functionally adequate in a first confrontation, should be tested with a wide variety of subjects, including people with a limited command of the language, dialect speakers, and people of different ages. However, none of them should have any experience in listening to synthetic speech. In telecommunications research, groups of between 12 and 16 subjects (all with English as their primary language) have been found sufficient to obtain stable mean values in judgment tests.

Test procedures

As indicated in section gif, speech output assessment techniques can be differentiated along a number of parameters. No parameters related to the actual test procedure were included there. Test procedures can vary with respect to subjects (see section gif), stimuli, and response modality.

Stimuli can vary along a large number of parameters, the most important of which are listed below.

In Appendix 1, summary descriptions of tests are given where the stimuli have been categorised along these stimulus parameters.

Response modality can vary along a number of parameters as well. The choice seems to be mainly determined by three factors: comparative versus diagnostic, functional versus judgment, and TTS-development versus psycholinguistic interest. In the five types of response modalities listed below, 1 and 2 are mainly used within the glass box approach (1 in TTS-development, 2 in psycholinguistically oriented research), whereas 3, 4 and 5 are more common in the black box approach. The latter three response modalities can be further differentiated in that 3 and 4 are functional in nature (3 in TTS-development, 4 in psycholinguistically oriented research), whereas 5 represents judgment testing. In the list of response modalities a distinction is made between off-line tests, where subjects are given some time to reflect before responding, and on-line tests, where an immediate response is expected from the subjects, tapping the perception process before it is finished.

  1. Off-line identification tests, where subjects are asked to transcribe the separate elements (sounds, words) making up the test items. This response modality can be further differentiated. With respect to the nature of the set of response categories there is a choice between:

    Transcription can be:

  2. On-line identification tests, requiring the subject to decide whether the stimulus does or does not exist as a word in the language (so-called lexical decision task, e.g. =1 (

    ; Pisoni Greene Nusbaum 1985) , =1 (

    ; Pisoni Nusbaum Greene 1985) ).

  3. Off-line comprehension tests, in which content questions have to be answered in an open or closed response mode (e.g. =1 (

    ; Pisoni Greene Nusbaum 1985) , =1 (

    ; Pisoni Nusbaum Greene 1985) .

  4. On-line comprehension tests, requiring the subject to indicate whether a statement is true or not (so-called sentence verification task, e.g. =1 (

    ; Manous et al. 1985) ).

  5. Judgment tests (also called opinion tests), involving the rating of scales (e.g.

    =1 (

    ; Pavlovic Rossi Espesser 1990) ; =1 (

    ; Delogu et al. 1991) ; =1 (

    ; ITU-T 1993) ).

The last response modality will be discussed in some more detail. Pavlovic and co-workers have conducted an extensive series of studies (cf. =1 (

; Pavlovic Rossi Espesser 1990) ) comparing different types of scaling methods that can be used in judgment tests to evaluate speech output. Much attention was paid to:

Pavlovic et al. stress that there are important differences between the two types of scaling methods, for example the fact that categorical estimation results in an interval scale, whereas magnitude estimation results in a ratio-scale. The former leads to the use of raw ratings, the calculation of the arithmetic mean, and the comparison of conditions in terms of differences, the latter leads to the use of the logarithm of the ratings, the geometric mean, and comparison in terms of ratios. The differences also have implications for the type of conclusions to be drawn from the test results. Both the categorical estimation method (with a 20-point scale) and the magnitude estimation method have been included in SOAP as standard SAM Overall Quality test procedures (see K in Appendix 1).

Recommendation 10
Use categorical estimation in rapid judgment tests with a view of internal comparison, and when you do, use at least a 10-point scale.

Recommendation 11
Use magnitude estimation for test external comparison, and when you do, use line length with imaginary ideal speech as a reference.

Benchmarks

By a benchmark test we mean an efficient, easily administered test, or set of tests, that can be used to express the performance of a speech output system (or some module thereof) in numerical terms. The benchmark itself is the value that characterises some reference system, against which a newly developed system is (implicitly) set off. The benchmark is preferably chosen such that it represents a performance level that is known to guarantee user satisfaction. Consequently, if the performance of a new product exceeds the benchmark, its designer or prospective buyer is assured of at least a satisfactory product, and probably even better. Obviously, testing against a benchmark is more efficient than pairwise or multiple testing of competing products.

At this time it is too early to talk about either existing benchmarks or benchmark tests. It is clear, however, that the development of benchmarking deserves high priority in the speech output assessment field. As a first step, existing tests should be scrutinised for their potential use as benchmark tests. Choices should be make as to what aspects to include in benchmark tests (overall performance, composite performance by a number of crucial modules), and what system to adopt as the reference on which the benchmark value should be based. In this respect, it seems to us that one should not adopt the performance of human speech as the benchmark. Human speech, at least when produced by professional talkers, will simply be too good for the purpose of benchmarking. Since human speech will always be superior to synthetic speech, the quality of the latter will have to be expressed as a fraction, which makes it hard to compare the relative differences between different types of synthetic speech. What we need is a speech output system of proven, but still imperfect, quality. This is, quite probably, the reason why the quality of many speech output systems for English is often expressed relative to the `Paul' voice of MITalk/DECTalk, which has long served as the de facto standard in TTS.

Reference conditions

Next to a widely accepted benchmark, it would appear to us that designers of speech output systems should want to know how well their systems perform relative to some optimum, and what performance could be expected of a system that contains no intelligence at all. In other words, the designer is looking for topline and baseline reference conditions. Reference conditions such as these do not yield diagnostic information in the strict sense of the word. However, they do provide the systems developer with an estimate of how much improvement can still be made to a system as a whole (in a black box approach) or to specific modules (in a glass box approach).

Segmental reference conditions

There has been no general practice to include topline and baseline reference conditions in segmental quality testing (section gif). Still, it seems to us that it is important to reach consensus on a number of measures. If the output system uses waveform concatenation techniques, the designer will want to know how well the synthesis performs relative to the live human speaker, or to facilitate procedures, to some electronic operationalisation of live speech (e.g. CD quality speech recorded at a short distance from the speaker's mouth in a quiet environment). However, if the system's waveforms have been coded with a lower bitrate than CD quality, the designer should determine to what extent degradation of system performance is due to the synthesis itself as opposed to the non-optimal bitrate. An easy way to determine this, is to adopt a second reference condition using the same (lower) bitrate as the synthesis. This precaution is even more necessary for parametric synthesis. Obviously, no type of parametric synthesis can be better than the maximum quality that is afforded by the analysis-resynthesis coding scheme adopted for the synthesiser. This requirement can generally be fulfilled when LPC synthesis schemes are used. However, for a range of synthesisers (e.g. the Klatt and the JSRU synthesisers) no automatic parameter estimation for straightforward analysis-resynthesis is possible at this time. The optimal parametric representation of human reference materials will then have to be found by trial and error, or else the attempt should be abandoned.

The designer of an output system claims that the intelligence incorporated into the synthesis systems (e.g. through rules) makes the systems perform better than with no intelligence built in at all. In order to establish the extent to which this claim is true, a baseline condition is needed which consists in a type of synthetic speech that has no knowledge of speech processes at all.

Recommendation 12
Absolute segmental topline: In the case of allophone synthesis, use human speech produced by a designated talker, i.e. the same individual on whose speech the table values and synthesis rules were based, or who, in the case of concatenative synthesis, provided the basic synthesis building blocks. The absolute topline reference will then be based on CD-quality digital speech.

Recommendation 13
Relative segmental topline for parametric synthesis: A second useful topline reference is the human reference speech (see Recommendation 12) but analysed and (re-)synthesised using exactly the same coding scheme that is employed in the speech output system to be tested.

Recommendation 14
Relative segmental topline for waveform concatenation: Use the same (lower) bitrate in the reference condition as in the speech output system.

Recommendation 15
Segmental baseline for allophone synthesis: Use speech in which all segments retain their table values and are strung together merely by smoothing spectral discontinuities at segment boundaries.

Recommendation 16
Segmental baseline for concatenative synthesis: Use speech made by stringing together coarticulatory neutral phones (i.e. stressed vowels spoken between two /s/-es, or stressed consonants preceded by schwa and followed by an unrounded central vowel, cf. the `neutrone' condition in =1 (

; Van Bezooijen Pols 1993) ). Minimal smoothing should be applied to avoid spectral jumps.

Prosodic reference conditions

The need for suitable topline and baseline reference conditions has clearly been recognised in the field of prosody (i.e. temporal and melodic structure, cf. section gif) testing. The following are recommendations for prosodic topline and baseline conditions. Note that, in contrast to segmental evaluation, listeners often find it very difficult to differentiate between different prosodic versions of an utterance. Therefore testers often need examples of `very bad' systems to check whether the listeners are indeed sensitive to prosodic differences.

Recommendation 17
Temporal and melodic topline: Copy, as accurately as possible within the limitations of the synthesiser, the temporal structures and speech melodies of a single designated professional human speaker onto the synthetic speech output.

Recommendation 18
Temporal baseline: Use a condition in which the smallest synthesis building blocks (phoneme, diphone, demisyllable) retain their original, unmanipulated durations as they were copied from the human original from which they were extracted (or, in the case of allophone synthesis, the phoneme duration table values, cf.

=1 (

; Carlson Granström Klatt 1979) ).

This baseline condition, then, contains no intelligence, so that any improvement in the target conditions with duration rules must be due to the added explicit knowledge on duration structure. A reference in which segment durations vary at random (within realistic bounds) can be included for validation purposes, as an example a `very bad system'. Listeners should rate this condition as poorer than any other condition.

Recommendation 19
Melodic baselines: Synthesise utterances on a monotone, at a pitch level that coincides with the average pitch of the test items. Also, include a random melodic reference for the sake of validation, by introducing random pitch variations (in terms of excursion size, rate of change, and segmental alignment), within physiologically and linguistically reasonable limits and with a mean pitch equal to the average of the test items.

There is a practical problem that not every synthesiser allows the generation of monotonous pitch so that some sort of waveform manipulation (e.g. pitch synchronous overlap and add, PSOLA) may have to be used in order to monotonise the synthetic melody.

Voice characteristics reference conditions

In the area of voice characteristics (voice quality, section gif), the problem of reference conditions has not been recognised. Generally, there seems to be little point in laying down a baseline reference for voice quality. The choice of a suitable topline would depend on the application of the speech output system. If the goal is personalised speech output (for the vocally handicapped) or automatic speaker conversion (as in interpreting telephony), the obvious topline is the speaker who is being modelled by the system, using the same coding scheme when applicable. When a general purpose (i.e. non-personalised) speech output system is the goal, one would first need to know the desired voice quality, i.e. ideal voices should be defined for specific applications, and speakers should be located who adequately represent the ideal voices. At this time we will refrain from making any further suggestions on this matter. The definition of `ideal' voices and voice qualities, and the implementation of topline references should be a matter of priority in the near future.

Overall quality reference conditions

Given the existence of an overall quality topline reference condition, it would be advantageous to have a set of reference conditions that are poorer than the optimum by a number of calibrated steps until a quality equal to or less than the baseline reference is reached (see also section gif). Such a set of reference conditions would yield a grid within which each type of speech, whether produced by humans or by machines, can be located and compared with other types of speech. Recently, attempts have been made at creating such a continuum of reference conditions by taking high-quality human speech and applying some calibrated distortion to it, such as multiplicative white noise at various signal-to-noise ratios (`Modulated Noise Reference Unit or MNRU', cf. ITU-T Recommendation P.81), or time-frequency warping (TFW, ITU-T Recommendation P.85, cf. =1 (

; Burrell 1991) ; or T-reference, cf. =1 (

; Cartier et al. 1992) ).

TFW introduces greater or lesser (random) deviations from the mean rate of a recorded utterance (2.5%, %, ... 20%) over successive stretches of 150 ms, so that the speech contains potentially disturbing accelerations and decelerations and associated frequency shifts. =1 (

; Fellbaum Klaus Sotscheck 1994) showed that the MNRU is not suitable for the evaluation of synthetic speech. TFW of natural speech, however, provided a highly sensitive reference grid within which TTS systems could be clearly differentiated from each other in terms of judged listening effort ( =1 (

; Johnston 1993) ). Moreover, Johnston showed that the perceived quality ordering among a range of TTS systems interacts with the sound pressure level at which the speech output is presented.

Recommendation 20
Use time-frequency warping of optimal human speech to create a grid of overall quality reference conditions.

Comparability across languages

Although it is generally agreed that in the final analysis all languages are equally complex, it cannot be denied that phonetic and phonological complexity differs widely from one language to the next. Languages differ in the size of their vowel and consonant inventories, in the complexity of syllable structures, stress rules, reduction processes, and so forth.

A number of systems are (commercially) available that provide multi-lingual speech output (e.g. DECTalk, Infovox, Multivox, Apollo). Generally, such systems were primarily developed for one language (American English, Swedish, Hungarian, and British English, respectively), and additional language modules were derived from the original language by making minimal changes to the basic units and rules. As a result it is commonly observed that the derivate languages of multi-lingual systems sound poorer than the original. Yet, it is very difficult to establish this convincingly, since the poorer performance may be due (completely or in part) to the greater intrinsic difficulty of the sound system of the new language. Ultimately one would like to develop speech output assessment techniques that allow us to determine the quality of a system speaking language A and to compare its quality to that of another system speaking language B. In order to reach this objective, we would have to know how to weigh the scores obtained for a language for the intrinsic difficulty or complexity of the relevant aspects in that language.

Such goals will not easily be accomplished. However, steps have been taken in the SAM project to ensure optimal cross-language comparability in the construction of the test materials and administration procedures. For example, in the Semantically Unpredictable Sentence Test (SUS Test, see Appendix 1 G), the same five syntactic structures (defined as linear sequences of functional parts of speech, e.g. Subject--Verb--Direct Object) are used in all languages tested, and words are substituted in each of the designated syntactic slots that are selected from the same lexical categories, and with the shortest word length allowed by the language (see G in Appendix 1). It should be obvious, however, that complete congruence cannot be obtained in this fashion: the shortest content words in Italian and Spanish are typically disyllables, whilst they are monosyllabic in the Germanic languages. Similarly, although all five syntactic structures occur in each of the languages tested, certain structures will be more common in one language than in an other. Given the existence of such intrinsic and unavoidable structural differences between languages, we recommend further research into the development of valid cross-language normalisation measures.

Especially when working within the European Union, with its increasing number of partner countries and languages, speech output products are likely to be produced on a multilingual basis. The further development of efficient testing procedures that can be validly used for all relevant languages is a clear priority. Yet, as we explained in section gif, we should not raise our hopes too high in this matter, given the existence of intrinsic and unavoidable structural differences between languages. For this reason we recommend parallel research into the development of valid cross-language normalisation measures, that will allow us to realistically compare speech output test results across languages, if the choice of test materials cannot be balanced in all relevant linguistic aspects.

In this effort, ITU recommendation P.85 has potential. Following this procedure (see section gif) a reference grid can be constructed for each (EU) language. One possible outcome could be that some languages prove more resistent to time-frequency warping than others (although we hesitate to make any predictions). Be this as it may, differences in intelligibility between languages would be effectively normalised out when we determine the quality of an output system relative to the reference grid that is applicable for the language being tested.



next up previous contents
Next: Black box approach Up: Synthesis assessment Previous: Towards a taxonomy



WWW Administrator
Fri May 19 11:53:36 MET DST 1995