Next: Further developments in Up: Synthesis assessment Previous: Black box approach

Glass box approach

Linguistic aspects

In this section we shall deal with evaluation procedures that have been, or can be, followed when modules in a text-to-speech system yield some intermediary symbolic output. As was stated above, there are no established methods for evaluating the quality of linguistic modules in speech output testing. As a result there is no agreed-upon methodology in this area nor are there evaluation experts; what little evaluation work is done, is done by the same researchers who developed the modules. In view of the lack of an established methodology we will refrain from making recommendations on the use of specific linguistic tests and test procedures. The need for a more general research effort towards a general methodology in the field of linguistic testing will be discussed in section .

Preprocessing

The first stage of a linguistic interface makes decisions on what to do with punctuation marks and other non-alphabetic textual symbols (e.g. parentheses), and expands abbreviations, acronyms, numbers, special symbols, etc. to full-blown orthographic strings, as follows:

There are no standardised tests for determining the adequacy of text preprocessors. Yet is seems that all preprocessors meet with the same classes of transduction problems, so that it would make sense to set up a multi-lingual benchmark for preprocessing. =1 (

; Laver et al. 1988) , =1 (

; Laver McAllister McAllister 1989) ), describing the internal structure of the CSTR text preprocessor, mention a number of transduction problems and present some quantification of their errors in the various categories, which we recapitulate in Figure . The test was run on a set of anomaly-rich texts taken from newspapers and technical journals.

Figure: Percent correct treatment of textual anomalies by CSTR text preprocessor (after =1 (

; Laver et al. 1988 : 12--15 ) )

The results in Figure are not so much revealing in terms of the numerical information they offer, but by the taxonomy of errors opted for. The only other formal evaluation of a text preprocessor that we have managed to locate uses a completely different set of error categories. =1 (

; Van Holsteijn 1993) presents an account of a text preprocessor for Dutch, and gives the results of a comprehensive evaluation of her module. She observes that the use of abbreviations, acronyms and symbols differs strongly from text to text. She broadly distinguishes three types of newspaper text: editorial text on home and foreign news, editorial text on sports and business, and telegraphic-style text (i.e. classified adds, film & theatre listings, radio & television guide). Text segmentation errors were separately counted for:

Sentence demarcation.
Expression demarcation (within sentence subunits that need some specific type of preprocessing).

Correctly demarcated expressions could then be characterised further in terms of:

Labelling errors,
Expansion errors.

Finally, a distinction is made between unavoidable and avoidable errors. The former type would be the result of incorrect or unavailable syntactic/semantic information that would be needed in order to choose between alternative solutions. The latter type is the kind of error that needs correction, either by the addition of new rules or by inclusion in the exceptions lexicon. Figure presents some results.

Figure: Evaluation results for text preprocessor TextScan (after =1 (

; Van Holsteijn 1993) ). Percent of avoidable errors in four categories; percent unavoidable errors in parentheses; N specifies the 100% base per cell.

Both =1 (

; Laver et al. 1988) 's and =1 (

; Van Holsteijn 1993) 's proposals represent rather crude, and disparate, approaches towards a taxonomy of errors of a text preprocessor. What is clearly needed for the evaluation of text preprocessors, is a more principled analysis of the various tasks a text preprocessor has to perform, focussing on those classes of difficulties that crop up in any (European) language. Procedures should be devised that automatically extract representative items from large collections of recent text (newspapers) in each of the relevant error categories, so that multi-lingual tests can be set up efficiently. Once the test materials have been selected, the correct solutions to, for instance, expansion problems can be extracted from existing databases, or when missing there, will have to be entered manually.

Grapheme-phoneme conversion

By grapheme-phoneme conversion we mean a process that accepts a full-blown orthographic input (i.e. the output of a preprocessor), and outputs a string of phonemes. The output string does not yet contain (word) stress marks, (sentence) accent positions, and boundaries. The correct phonemic representation of a normally spelled word depends on its linear context and hierarchical position (e.g. assimilation to adjacent words: I have to go /aI h{ f t@ g@U/ but I have two goals /aI h{ v tu: g@Ulz/; or the choice between homographs: I lead /li:d/ but made of lead /led/. Therefore the adequacy of grapheme-phoneme conversion modules should not, in principle, be tested on the basis of isolated word pronunciation (citation forms). In practice, however, this is precisely what is done. The reasons for this are threefold:

For many languages pronunciation databases (or machine readable pronouncing dictionaries) are available, which are exclusively based on isolated words.
The adaptation rules for word pronunciation in context are (believed to be) straightforward, exceptionless, and easy to implement, and
Machine readable phonemic transcriptions of continuous prose are scarce.

Figure presents results of a multi-lingual evaluation of grapheme-phoneme converters for seven EU languages, performed within ESPRIT 291/860 ``Linguistic analyses of European languages,'' based on isolated word pronunciation. Since it has often been reported that many more conversion errors occur in proper names than in ordinary words, the evaluation distinguished between four types of materials:

Newspaper texts of at least 100,000 words,
The names of the 150 largest towns per country,
Names of 31 European capitals in the national spelling, and
The 150 most frequent Christian names for each language.

Figure: Percent correct grapheme-phoneme conversion in seven EU languages in four types of materials (after =1 (

; Pols 1991 : 394 ) )

Note: Newspaper scores are weighed for token frequency. Higher first score for French excludes all preprocessing errors; higher first German score is based on the use of an exceptions list.

Incidentally, the results should not be taken to indicate that spelling is harder to convert to phonemes in Italian than in any other language, since different conversion methods were used in each language; however, Italian proper names are no more a problem than ordinary text words. In English and French spelling the proper names do present a serious problem, so that exceptions lists will be a priority there.

In a complementary test =1 (

; Nunn Van Heuven 1993) compared the performance of three grapheme-phoneme converters for Dutch, i.e. two systems with no or only implicit morphological decomposition ( =1 (

; Kerkhoff Wester Boves 1984) ; =1 (

; Berendsen Langeweg Van Leeuwen 1986) and one that included the MORPA morphological decomposition module. About 2,000 simplex and complex (see section ) test words were selected from newspaper texts that did not belong to the 10,000 most frequent Dutch words, so that dictionary look-up would fail. Phoneme, syllabification, and stress placement errors were found by automatic comparison with a hand-made master transcription file. The earlier converters performed at a success rate of 60% and 64%, which is considerably poorer than the newspaper text score in Figure . The newer system with explicit morphological decomposition was correct in 78 percent.

Word stress

Stressed syllables are generally pronounced with greater duration, greater loudness (in terms of acoustical intensity as well as pre-emphasis on higher frequencies), and greater articulatory precision (no consonant deletions, more peripheral vowel formant values). Moreover, when a word is presented in focus, a prominence-lending fast pitch movement is executed on the stressed syllable of that word. With the exception of French, where the stress is always on the last full syllable of the word, the position of the stress varies from word to word in all other EU languages. However, stress position in these languages is predictable to a large extent by rules that look at:

The internal make-up of words (in terms of their constituent morphemes, the hierarchical relationships between adjacent morphemes, and the lexical categories of these morphemes), and
The segment structure of the syllables making up the morphemes.

All the EU languages have a proportion of idiosyncratic words that do not comply with the proposed stress rules for diverse reasons. Therefore the coverage of stress rule systems has to be evaluated, and errors have to be corrected by including the problematic words in an exceptions dictionary.

Tests of stress rule modules have been performed only on an ad hoc basis, either checking the output of the rules by hand ( =1 (

; Barber et al. 1989) for Italian), or automatically (using the phonemic transcription field in lexical databases containing stress marks ( =1 (

; Langeweg 1988) for Dutch), which in turn had been checked by hand in some earlier stage of the database development) .

Morphological decomposition

In morphological decomposition orthographic words are analysed into morphemes, i.e. elements belonging to the finite set of smallest sub-word parts with an identifiable meaning. Morphological decomposition is necessary when the language/spelling allows words to be strung together without intervening spaces or hyphens so as to form an indefinitely large number of complex, longer words. For many EU languages word-internal morpheme boundaries are referred to by the grapheme-phoneme conversion rules. For instance, if the English letter sequence sh is pronounced as /S/ when it occurs morpheme internally as in bishop, but is pronounced as /s/ followed by /h/ when a morpheme boundary intervenes, as in mishap.

Obviously, long and complex words will have to be broken up into smaller basic words and affixes (i.e. morphemes) before the parts can be looked up in an exceptions dictionary. If all complex words were to be integrally stored in the lexicon, it would soon grow to unmanageable proportions. For stress placement rules it is sometimes necessary to refer to the hierarchical relationships between the constituent morphemes (e.g. 'lighthouse keeper, light 'housekeeper, where ' denotes main stress) and to the lexical category of the word-final morpheme (which generally determines the lexical category of the complex word as a whole, e.g. black+bird is a noun, pitch+black is an adjective). Morphological decomposition is a notoriously difficult task, as one input string can often be analysed in a large number of different ways. The hard problem is choosing the correct solution out of the many possible solutions.

As far as we have been able to ascertain, there are no established test procedures for evaluating the performance of morphological decomposition modules. =1 (

; Laver et al. 1988 :12--16 ) ) tested the morphological decomposition module of the CSTR TTS on 500 words randomly sampled from a 85,000 word type list, which was compiled from a large text corpus as well as from two machine-readable dictionaries. The output of the module was examined by hand, and proved accurate at 70% (which seems rather low considering the fact that the elements of English compounds are generally separated by spaces or hyphens).

The Dutch morphological decomposition module MORPA (MORphological PArser, cf.

=1 (

; Heemskerk Van Heuven 1993) compared the module's output with pre-stored morphological decompositions in a lexical database. In this comparison only segmentation errors were counted, in a sample of 3,077 (simplex and complex) words taken from weekly newspapers. The results showed that in 3 percent of the input the whole word, or part of it, could not be matched with any entry in the MORPA morpheme lexicon. The frequency of this type of error depends on the coverage of the lexicon. Erroneous analyses were generated in another 1 percent of the input words. In all other cases the correct morphological segmentation was generated, either as the single correct solution (44%), or as the most likely solution in an ordered list of candidate segmentations (48%), or as one of the less probable candidate solutions (3%) Although both the accuracy and the coverage of the MORPA module seems excellent by today's standards, the module proved too slow for realistic text-to-speech applications. Processing speed is therefore an important criterion in the evaluation of morphological parsers. There will be a speed/ accuracy/coverage trade-off in the evaluation of morphological parsers.

Syntactic parsing

Syntactic analysis lays the groundwork for the derivation of the prosodic structure needed to demarcate the phonological phrases (whose boundaries block assimilation and stress clash avoidance rules) and intonation domains (whose boundaries are marked by deceleration, pause insertion and boundary marking pitch movements). Syntactic structure also determines (in part) which words have to be accented. Finally, lexical category disambiguation is often a by-product of a syntactic parser.

Although the syntactic parser is an important module in any advanced TTS, we take the view that, in principle, its development and evaluation does not belong to the domain of speech output systems. Syntactic parsing is much more a language engineering challenge, needed in automatic translation systems, grammar checking, and the like. For this reason, we refer to the chapters produced by the EAGLES Working Groups on the evaluation of Automatic Translation and Translation tools.

Sentence accent

Appropriate accentuation is necessary to direct the listener's attention to the important words in the sentence. Inappropriate accentuation may lead to misunderstandings and delays in processing time (cf. =1 (

; Terken 1985) ). For this reason most TTS-systems provide for accent placement rules. Accentuation rules can be evaluated at the symbolic and the acoustic level.

=1 (

; Monaghan Ladd 1989) , =1 (

; Monaghan Ladd 1990) tested the symbolic output of a sentence accent assignment algorithm applied to four English 250 word texts (transcripts of radio broadcasts). The algorithm generated primary and secondary accents, which were rated on a 4-point appropriateness scale by three expert judges. =1 (

; Van Bezooijen Pols 1989) tested a Dutch accent assignment algorithm at the symbolic as well as the acoustic level (only one type of accent is postulated for Dutch) using 8 isolated sentences and 8 short newspaper texts. Two important points emerged from this study:

Correlations between the symbolic and the acoustic evaluations were significant but rather low, which means that tests at the symbolic level are no adequate substitute for acoustic tests, and
Ratings for isolated sentences were more favourable than for sentences in paragraphs, which means that paragraph testing is necessary if the speech output system has to produce connected text.

Again, these are scattered tests, addressing only a handful of the problems that a linguistic module has to take care of. We would recommend the development of a comprehensive test procedure that identifies categories of accent placement error at the sentence and the paragraph level. The principles that underlie accent placement are largely the same across EU languages, so that it makes sense to develop the test procedure on a multi-lingual basis.

Acoustic aspects

In section , speech output assessment was approached within a black box context, i.e. with an emphasis on the speech output as a whole. Black box tests are by force acoustic in nature. In the present section acoustic tests are discussed as well, however this time within a glass box context, which means that attention is focussed on the quality of separate modules, mainly with a view of diagnostic testing. The structure of this section is based on traditional views in phonetics (e.g. =1 (

; Abercrombie 1967) ) according to which three layers are present in speech: a segmental layer (related to short-term fluctuations in the speech signal), a voice dynamics or prosodic layer (medium-term fluctuations), and a voice characteristics (or voice quality) layer (long-term fluctuations). We will make the same distinction here, being concerned with testing segments, with prosody, and with voice characteristics.

Segments

Functions of segments

The primary function of segments, i.e. the consonants and vowels in the language, is simply to enable listeners to recognise words. Generally, when the segments are sufficiently identifiable, words can be recognised regardless of the durations of the segments and the melodic pattern. In the experience of most researchers good quality (readily identifiable) vowels are afforded by even the simplest speech synthesis systems. One reason is that most coding schemes allow adequate parametrisation of vocalic sounds (narrow band formants slowly varying with time). The synthesis of good quality consonants is an altogether different matter (due to multiple excitation signals, notion of formant not always applicable, abrupt spectral changes), and this is where most (parametric) synthesisers show defects.

Moreover, since speech extends along the time dimension, segments early in the word in practice contribute more to auditory word recognition than later segments. Trailing segments, especially in long (i.e. polysyllabic) words are often not needed to distinguish the word from its competitors. Also, stressed syllables tend to contribute more to a word's identity than segments in unstressed syllables. For these reasons it makes sense to break down the segmental quality of speech output systems for vowels and consonants in various positions (initial, medial, final), within monosyllabic and polysyllabic words, and in stressed versus unstressed syllables.

Segmental tests

Of all aspects of synthetic speech output, the evaluation of the segmental aspect has received most attention till now, because:

Good segmental quality is considered to be the main prerequisite for good overall quality.
There is general agreement about the relevant categories in terms of which quality can be assessed, namely phonemes.
It is easy to establish accuracy: a phoneme intended to represent /f/ is identified accurately if listeners respond with /f/.

Near perfect segmental quality is essential for applications with a strong emphasis on the transmission of low-predictability information to untrained listeners, for example traffic information services and reverse telephone directory assistance ( What name and address belong to this telephone number?). Unlike in the case of ``normal'' words, the pronunciation of names cannot be deduced from the context. Moreover, for names it is particularly important that each consonant and vowel be clearly enunciated because there are many near-homophones, i.e. names that differ in just one sound, and strange names which listeners may never have heard before. In applications like these, where prosody is of minor importance, the required intelligibility level can be attained for instance by making use of canned speech or waveform concatenation. In other applications, where text-to-speech is preferred, it may perhaps not be necessary for each sound to be identified correctly. However, since very little is known as yet on the specific contributions of single sounds to overall intelligibility, synthesis designers have usually taken the pragmatic position that in principle all sounds should be identifiable. In that case detailed diagnostic testing of segmental quality using a glass box approach remains necessary.

As stated above, many tests have been developed to evaluate the quality of synthetic segments. There is a basic distinction between segmental tests at the word level, where single words (meaningful, meaningless or lexically unpredictable) are presented to listeners, and segmental tests at the sentence level, where complete sentences (meaningful, meaningless, or semantically unpredictable) are presented to listeners. Within either category tests can be categorised in functional and judgment studies.

Segmental tests at the word level

Functional segmental tests at the word level

The test approach used to evaluate segments at the word level has been mostly functional, quality being expressed in terms of percent correct phoneme identification. In this section we will discuss the Diagnostic Rhyme Test (DRT), the Modified Rhyme Test (MRT), the SAM Standard Segmental Test, an anonymous type of test which we shall henceforth call the Diphone Test, the Cluster IDentification (CLID) Test, the Bellcore Test, and the (Modified) Minimal Pairs Intelligibility Test. The reasons why these tests were selected to be included are varied: because they are well-known, well-designed, easy and fast to administer and/or promising. Summary information on most tests is provided in Appendix 1.

DRT (Diagnostic Rhyme Test) and MRT (Modified Rhyme Test)

The DRT (see D in Appendix 1) is a closed response test with two response alternatives. Items presented to the subjects are of the form CVC, i.e. an initial Consonant followed by a medial Vowel followed by a final Consonant. The identifiability of the medial vowel and final consonant is not examined, only the identifiability of the initial consonant is tested. All items are meaningful, which means that only factors (1) through (3) as listed in section can influence the results. In order to obtain insight into the precise nature of possibly poor identifiability of initial consonants, the two categories from among which the subjects are forced to select the correct response contain minimal phonemic contrasts. The subject would be asked for instance to indicate whether a synthetic item was intended as dune or tune.

The MRT (see E in Appendix 1) is an (originally) closed response test with six response alternatives. All items are of the form CVC and (in its original form) meaningful. Both the identifiability of initial and of final consonants are tested, but never simultaneously. An example of response alternatives testing the identifiability of a final consonant would be peas, peak, peal, peace, peach, and peat.

The use of meaningful test items has some positive effects:

The DRT and MRT are reliable, fast, and easy to administer and score.
No training is required of subjects because they can respond in normal spelling.

However, the DRT and MRT have some serious drawbacks and restrictions as well:

Intelligibility may be overestimated since subjects adjust their perception to the response categories presented to them.
There is a risk of ceiling effect.
Due to their restricted coverage and their limitation to meaningful words, they have little diagnostic value.

Both the DRT and MRT have a long tradition in speech output assessment and have been used in many studies, mainly for comparative purposes. The DRT has been employed among others by =1 (

; Pratt 1987) , who compared a wide range of synthetic voices/systems and a human reference, both clear and with noise added to give a speech-to-noise ratio of 0 dB(A). Eight subjects participated. The percentages correct for the human voice and five synthesisers are given in Figure .

Figure: Some results obtained with the DRT by =1 (

; Pratt 1987)

All factors (speech system, speech-to-noise ratio, and type of phonemic contrast in the two response categories) had a significant effect on the percents correct identification. More interestingly, all interactions appeared to be significant as well. For example, as can be seen above, the intelligibility of synthetic speech was affected by adding noise to a much higher degree than that of human speech. Moreover, adding noise extended the range of the percentages correct, thus making the test more sensitive. So, if synthesis systems are compared which are rather similar, it might be a good idea to add noise.

The MRT has been employed, among others, by =1 (

; Logan Pisoni Greene 1985) to evaluate eight synthesisers and a human reference. On the basis of the results, the systems were grouped into four categories, namely (1) human voice, (2) high-quality DECtalk 1.8 Paul, DECtalk 1.8 Betty, MITalk-79, Prose 3.0, (3) moderate-quality Infovox SA 101, Berkeley, TSI-proto 1, and (4) low-quality Votrax Type'n'Talk, and Echo. Percentages correct are given in Figure for the closed response variant.

Figure: Some results for the MRT ( =1 (

; Logan Pisoni Greene 1985) )

The categories distinguished could be used as benchmarks (although the data are somewhat dated, the set of synthesisers tested is probably representative of the quality range of more recent synthesisers). Methodological matters were considered as well. A test/retest design proved the MRT to be reliable. Moreover, the closed and open response variants (compared for the five best systems) yielded the same rank order.

SAM Standard Segmental Test

For diagnostic purposes the SAM Standard Segmental Test developed within the Speech Assessment Methods (SAM) project of ESPRIT (see A in Appendix 1) is to be preferred above the DRT and MRT tests. The test items in this test consist of meaningless and (sometimes by chance) meaningful, i.e. lexically unpredictable stimuli, which means that factors (1) and (2) as listed in section have an effect on the responses. Items are CV, VC, and VCV stimuli, where C stands for all consonants allowed in the given position in a given language and V for one of the three point vowels of the given language, typically open /a/, close front /i/, and close back /u/. So, all permissible consonants are tested in word initial, word medial, and word final position. Vowels are not tested; they provide varying phonetic contexts within which the consonants to be tested are placed (the identifiability of sounds can vary depending upon neighbouring sounds). Examples of test items are pa, ap, apa, ki, ik, and iki. An open response format is used, i.e. listeners choose a response from among all consonants.

The SAM Standard Segmental Test has many positive points:

It is relatively fast.
It provides precise information on the characteristics of consonants as perceived by listeners, these can be compared to the intended characteristics, thus indicating in which direction improvement should be sought.
It allows comparison of segmental intelligibility over languages.
Software is available, allowing automatic stimulus generation and scoring (in fact, the stimulus set is really a subset of the CLID test (see B in Appendix 1), which allows for more vowel contexts and syllable structures).

The main disadvantage of the SAM Standard Segmental Test is that:

It is restricted in that only single consonants are tested, whilst vowels are left out of consideration completely.

Part of the SAM Standard Segmental Test has been applied to English, German, Swedish, and Dutch synthesisers. Comparative results are available for Swedish medial C produced by a human and two synthesisers as perceived by listeners with perfect and imperfect hearing ( =1 (

; Goldstein Till 1992) ). The percentages correct medial C identification are given in Figure . Of the 54 test items, 3 were found to differ significantly (p=2%) between human and KTH, 9 between human and Infovox, and 3 between KTH and Infovox.

Figure: Some results for the SAM Standard Segmental Test ( =1 (

; Goldstein Till 1992) )

Diphone Test

A more complete overview of the performance of segments in a wider variety of contexts is provided by a test which assesses the intelligibility of all permissible (pronounceable) CVC, CVVC, VCV, and VCCV sequences of a given language. Such a test will be referred to as a Diphone Test, because the test items can be constructed by combining all the diphones in a diphone inventory. Just like in the SAM Standard Segmental Test, the test items are lexically unpredictable and the response categories open, so that it is useful for diagnostic purposes. Extra advantages of the Diphone Test over the SAM Standard Segmental Test are the following:

The Diphone Test is less restricted than the SAM Standard Segmental Test because in addition to consonants vowels are tested,
Both monosyllabic (CVC) and disyllabic (CV--VC, V--CV, VC--CV) structures are represented,
Due to the VCCV items, heterosyllabic consonant clusters are included, i.e. sequences of consonants belonging to different syllables.

The main disadvantages of the Diphone Test are:

It is not incorporated in a software package,
It does not comprise tautosyllabic consonants, i.e. sequences of consonants within the same syllable.

The Diphone Test has been used to evaluate diphone synthesis in French ( =1 (

; Pols et al. 1987) ), Italian ( =1 (

; Van Son et al. 1988) ), and Dutch ( =1 (

; Van Bezooijen Pols 1987) ). The Dutch Diphone Test combined all Dutch diphones into a set of 768 test items: 307 CVC, 173 VCV, 267 VCCV, and 21 CVVC. The only thing needed to construct the test material for a particular language is a matrix with the phonotactic constraints operating in that language, i.e. restrictions on the occurrence of all consonants and vowels in various word positions and phonetic contexts. Such matrices have been constructed for a number of European languages within the ESPRIT-SAM project.

Bellcore Test and CLID (CLuster IDentification) Test

As mentioned above, even the Diphone Test is not complete, since no tautosyllabic consonant clusters are included. The importance of this structure should not be underestimated. According to =1 (

; Spiegel et al. 1990) , about 40% of all one-syllable words in English begin and 60% end with consonant clusters. The Bellcore Test (see C in Appendix 1) and the CLID Test (see B in Appendix 1), have been developed to fill this gap.

The Bellcore Test has a fixed set of CVC-stimuli, comprising both meaningless and meaningful words, e.g. frimp and friend or glurch and parch. Tautosyllabic consonant clusters and single clusters are tested separately in initial and final position. Vowels are not tested. Open response categories are used. Compared to the Diphone Test, the Bellcore Test has some advantages, the main being that:

It includes tautosyllabic consonant clusters.

However, the Bellcore test has disadvantages as well:

It is confined to monosyllabic structures.
Vowels are not tested.
No test material is available for other languages than English.

The test has been applied to assess the intelligibility of two synthesisers compared with human speech, presented over the telephone ( =1 (

; Spiegel et al. 1990) ). The syllable score was 88% for human telephone speech and around 70% for the synthetic telephone speech. Consonant clusters had lower intelligibility than single consonants. Intelligibility for meaningful words was higher than for meaningless words, a finding which could not be explained.

The CLID Test is a very flexible architecture which can be used for generating a wide variety of monosyllabic test items in an in principle unlimited number of languages. Both meaningful and meaningless items can be generated as long as matrices are available with the phonotactic constraints to be taken into account. Open response categories are used. Intelligibility can be assessed in whatever way one wants. The CLID test has been applied to test the intelligibility of one German synthesiser ( =1 (

; Jekosch 1992) ) and the intelligibility of five German synthesisers ( =1 (

; Kraft Portele 1995) ).

The CLID test had all the advantages of the SAM Segmental Test listed above, whereas it does not have the disadvantage mentioned. Thus the positive points of the CLID test are the following:

It is relatively fast.
Clusters of consonants (tautosyllabic and heterosyllabic) and single consonants are tested as well as clusters of vowels and single vowels.
It provides detailed diagnostic information.
It allows comparison of segmental intelligibility over languages.
Software is available, allowing automatic stimulus generation and scoring.

Although the CLID test allows testing of a wide variety of segments, the range covered is not complete. For example:

Segments in unstressed syllables are missing.

(Modified) Minimal Pairs Intelligibility Test

The last tests we want to mention in this context, are the so-called Minimal Pairs Intelligibility Test (MPI Test), proposed by =1 (

; Van Santen 1993) as an alternative to the DRT, and a modification to it introduced by =1 (

; Syrdal Sciacca 1994) , the Diagnostic Pairs Sentence Intelligibility Evaluation Test (DPSIE Test). These tests were designed to reduce ceiling effects and expand the coverage of the DRT to include:

Vowels,
Tautosyllabic as well as heterosyllabic consonant clusters,
Unstressed syllables,
De-accented or cliticised words,
Words in sentences,
Polysyllabic words,
Insertions and deletions.

The MPI Test consists of a fixed set of 256 sentence pairs containing one contrast, e.g. The horrid courts scorch a revolution versus The horrid courts score a revolution. The minimal pair appears on the screen and the correct sentence has to be identified. Differences between the MPI Test and the DSPIE Test include:

A change of binary =1 (
; Jakobson Fant Halle 1951) distinctive features to multi-valued
=1 (
; Ladefoged 1975) distinctive features, so that more realistic perceptual confusions can be tested.
DPSIE generates its word pairs matched for familiarity (i.e. token frequency) and verb transitivity.

The main advantage of the MPI and DPSIE Tests is that:

The intelligibility of segments is evaluated in a virtually complete variety of conditions and phonetic contexts, potentially providing a host of diagnostic information.

The main disadvantages of the tests are that:

They are inefficient, since each response gives information on the identifiability of only one phoneme.
Inclusion of segments in stimuli is restricted by their occurrence in the language.
Creating test materials presupposes the availability of large databases.

Recommendation 28
Use the CLID Test for the evaluation of the segmental intelligibility at the word level, both for diagnostic and comparative purposes (in the latter case the stimulus set can be smaller).

Judgment tests at the word level

In principle, in addition to functional intelligibility tests, judgment tests, where subjects rate their subjective impression of the stimuli on scales, are possible for evaluating the segmental quality at the word level as well. For example, =1 (

; Van Bezooijen 1988) , in addition to running a consonant cluster identification test, presented 26 Dutch consonant clusters (both initial and final) to be rated on naturalness, intelligibility, and pleasantness. The clusters were embedded in meaningful words. In order to obtain ``pure'' judgments, unaffected by the quality of the rest of the word, subjects were explicitly asked to pay attention to the clusters only. So, the test required analytic listening. However, one can never be sure to what extent listeners in fact stick to the instructions. Perhaps this is one of the reasons why judgment tests of this type have been rare.

Segmental tests at the sentence level

In addition to the word level, tests for the assessment of segmental quality have been developed at the sentence level as well. Here the effect of prosody could be minimised by presenting the material on a monotone, but in practice, if only for naturalness' sake, prosody is usually included. Compared with the segmental tests at the word level, tests at the sentence level are more similar to speech perception in normal communication but at the same time, as a consequence, less suitable for diagnostic purposes, because of the following reasons:

With sentences, the intelligibility scores will not only be based on segmental quality but also to some extent on prosodic quality, so that poor intelligibility is more difficult to trace back to specific sources.
The composition of the test material is somewhat unsystematic, so that no complete confusion matrices can be obtained.
Especially with semantically normal sentences listeners will not only rely on segmental information but use other information sources as well, related to word combinatory probabilities, and, in the case of meaningful sentences, to semantic coherence (factors (4) and (5) in section ).

Of course, if the test is not intended as a diagnostic tool but has a purely comparative aim, these consequences of using sentences do not necessarily detract from its value. However, it is important to remember that as soon as complete sentences are presented to listeners, the test is no longer limited to evaluating segmental quality alone. This means that the title of this section ``segmental tests at the sentence level'' is not completely adequate. In fact, depending on the extent to which restrictions are imposed on the construction of the test materials, tests at the sentence level are in between a glass box approach and a black box approach. So, the main differences among the segmental sentence tests described is their position on the glass box black box continuum.

In this section only functional tests will be discussed. In addition, judgment tests at the sentence level have frequently been carried out. These are described under the heading `black box approach' in , where judgment tests to evaluate overall output quality are discussed. These tests entail the rating of scales such as acceptability, intelligibility and naturalness.

Harvard Psychoacoustic Sentences

One of the most well-known segmental intelligibility tests at the sentence level is the fixed set of 100 semantically and syntactically ``normal'' Harvard Psychoacoustic Sentences ( Add salt before you fry the egg) (see H in Appendix 1). Intelligibility is expressed by means of the percent correctly identified keywords (nouns and verbs). In this test no restrictions are placed upon the composition of the test materials, which means that the percent correct responses is determined only to a limited extent by the acoustic characteristics of the individual segments. This test would therefore have to be placed towards the black box end of the continuum. In terms of the factors listed in section , only (6), (7) and (8) are excluded.

The main advantages of the Harvard Psychoacoustic Sentences Test are:

It is easy to administer (no training required of the subjects).
Responses are easy to score (be it manually).

The main disadvantages of the test are:

Limited to comparative purposes.
A strong learning effect (the same sentences can only be used once).
Restricted number of test items.
Restricted generalisability, because there is only one syntactic structure.
Danger of a ceiling effect (so only fit to be used for low-quality systems).

The Harvard Psychoacoustic Sentences were compared with the Haskins sentences by

=1 (

; Pisoni Greene Nusbaum 1985) , =1 (

; Pisoni Nusbaum Greene 1985) ) for four synthesisers and human speech (see below).

Haskins Syntactic Sentences

Another famous test at the sentence level is the fixed set of 100 Haskins Syntactic Sentences (see F in Appendix 1). These sentences are semantically unpredictable, which means that they do not occur in daily life. An example is The old farm cost the blood. In terms of advantages and disadvantages, the Harvard Sentences and Haskins Sentences have much in common. The only difference is that Haskins listeners can rely less on semantic coherence (factor (5) in the list of factors in section ) so that the role of the acoustic characteristics of the segments is more important. Therefore, the Haskins sentences find themselves somewhat closer to the glass box end of the continuum than the Harvard sentences. The Haskins sentences were applied to four synthesisers and human speech by =1 (

; Pisoni Greene Nusbaum 1985) , =1 (

; Pisoni Nusbaum Greene 1985) , and compared with the Harvard sentences. Percent correct keyword identification are given in Figure .

Figure: Some results for the Harvard Psychoacoustic Sentences and Haskins Syntactic Sentences

It can be seen that the two tests yield the same rankorder. However, as expected, due to the reduced semantic coherence, the Haskins sentences are more sensitive.

Semantically Unpredictable Sentences (SUS)

Both the Harvard and Haskins sentences had a fixed set of sentences, characterised by a single syntactic structure, which could be used as test materials. More recently, a more flexible approach was opted for in the Semantically Unpredictable Sentences (SUS), developed by SAM (see G in Appendix 1). The test materials in the SUS consist of a fixed set of five syntactic structures which are common in most Western European languages, such as `Subject--Verb--Adverbial' ( The table walked through the blue truth). The lexical slots in these structures are filled with high-frequency words from language specific lexicons. The resulting stimulus sentences are semantically unpredictable, just like the Haskins Syntactic Sentences.

The advantages of the SUS Test are the following:

Easy to administer (no training required of the subjects).
Responses can be scored automatically.
Reasonable generalisability due to the variation in syntactic structures and lexical items.
Comparability across languages.

The main disadvantages of the test are:

Limited to comparative purposes.
Learning effects (despite the large number of test items).
Risk of ceiling effects (so only fit to be used for low-quality systems).

Pilot studies with the SUS test have been run in French, German, and English ( =1 (

; Benoît 1989) ; =1 (

; Benoît et al. 1989) ; =1 (

; Hazan Grice 1989) ). Results showed, among other things, that keywords presented in isolation were identified significantly less well than the same words in a sentence context. This is attributed in part to the fact that the syntactic category of the isolated words is not known. Furthermore, the SUS were found to be sensitive enough to discriminate between two synthesisers differing in prosody.

Recommendation 29
Use the SUS Test to evaluate intelligibility for comparative purposes at the sentence level.

Prosody

Functions of prosody

By prosody we mean the ensemble of properties of speech utterances that cannot be derived in a straightforward fashion from the identity of the vowel and consonant phonemes that are strung together in the linguistic representation underlying the speech utterance. Prosody would then comprise the melody of the speech, word and phrase boundaries, (word) stress, (sentence) accent, tempo, and changes in speaking rate. We exclude from the realm of prosody the class of voice characteristics (see section ).

Prosodic features may be used to differentiate between otherwise identical words in a language (e.g. trusty -- trustee, with initial stress versus final stress, respectively). Yet, word stress is not so much concerned with making lexical distinctions (this is what vowels and consonants are for) as with providing checks and bounds to the word recognition process. Hearing a stressed syllable in languages with more or less fixed stress informs the listener where a new word may begin; error responses in word recognition strongly tend to agree with the stimulus in terms of stress position. In a minority of the EU languages (Swedish, Norwegian) lexical tone (rather than stress) is exploited for the purpose of differentiating between segmentally identical words.

The more important functions of prosody, however, are located at the linguistic levels above the word:

Prosody offers explicit segmentation cues in the form of phrase boundaries, i.e. it tells the listener which words go together and should be interpreted as making up a coherent chunk of information; also, these cues allow the listener to determine the ``depth'' of the break between chunks, i.e. whether the end of a word group, clause, sentence, or even a whole paragraph has been reached.
Prosody provides an indication for the listener which words are presented by the speaker as expressing important information (highlighting or focussing through accentuation).
Prosody, especially melody, carries some meaning of its own (intonational meaning) which, for example, allows the speaker to present a sentence as a statement or a question, or to express emotions and/or attitudes towards the verbal contents of the message or towards the hearer.

These functions suggest that prosody affects comprehension (establishing the semantic relationships between words) rather than intelligibility (determining the identity of words), and, indeed, this is what most functional tests of prosody aim to evaluate.

Judgment tests of prosody

Evaluation of the prosody of speech output systems is alternately focussed on the formal or the functional aspects. Only a handful of tests are directed at the formal quality of temporal organisation. An exemplary evaluation study on the duration rules of MITalk ( =1 (

; Allen Hunnicutt Klatt 1987) ) was done by =1 (

; Carlson Granström Klatt 1979) . They generated six different versions of a set of sentences by including or excluding effects of consonant duration rules, vowel duration rules, and stressed syllable and preboundary lengthening rules in the synthesis. These versions were compared with a topline condition where (normalised) segment durations copied from human versions of the test sentences spoken by the MITalk designated talker were imposed on the synthesis. There were two baseline conditions, one with the neutral (inherent) table values substituted for all segments, and one with random segment duration variation (within realistic bounds). The results showed that the temporal organisation afforded by the complete rule set was judged as natural as the human topline control. Moreover, sentences generated with boundary markers at minor and major breaks were judged more natural than speech without boundary markers.

More work has been done in the field of melodic structure. Let us first consider judgments of formal aspects of speech melody. The formal properties of, for example, pitch movements or complete speech melodies can be tested by asking groups of listeners (either naive or expert) to state their preference in pairwise comparisons or to rate a melody in a more absolute way along some goodness or naturalness scale. At the level of elementary pitch movements (such as accent-lending or boundary marking rises, falls, or rise-fall combinations) the SAM Prosodic Form Test (see I in Appendix 1) is a useful tool. The test was applied to two English and two Italian synthesisers, with 3 contours, 4 levels of segmental complexity, 5 items at each level, 4 repetitions of each token ( =1 (

; Grice Vagges Hirst 1991) ). Significant effects were found of synthesiser and contour, as well as of the interactions between synthesiser and contour and between synthesiser, complexity, and contour. By relating the scores for the contours to those for the monotone reference, the effect of differences in segmental quality on the ratings could be cancelled out.

Using the same methodology, i.e. rating and pairwise comparisons, the quality of synthetic speech melody can be evaluated at the higher linguistic levels. At the level of isolated sentences pairwise comparisons of competing intonation-by-rule modules is feasible when the number of systems (or versions) is limited (e.g. =1 (

; Akers Lennig 1985) ). When multiple modules are tested using a larger variety of sentences and melodies, scale rating is to be preferred over pairwise comparisons for reasons of efficiency ( =1 (

; De Pijper 1983) ; =1 (

; Willems Collier Hart 1988) ).

Evaluation of speech melody generators should not stop at the level of isolated sentences. Ratings by expert listeners in Dutch could not reveal any quality differences between synthetic melodies and a human reference when the sentences were listened to in isolation; however, the same synthetic melodies proved inferior to the human reference when they were presented in the context of their full paragraph ( =1 (

; Terken Collier 1989) ). Along the same lines, =1 (

; Salza et al. 1993) evaluated the prosody of the Eloquens TTS for Italian in a train schedule consultation application. Three hundred sentences were tested, realistically distributed over seven melodic modalities: Command sentences, Simple declaratives, List sentences, Wh-questions, Yes/no-questions, Yes/no-echo questions, and Yes-no modal questions. Expert listeners' scores did not differ from those of naive subjects, and scores were better for utterances presented as part of a dialogue than for sentences presented in isolation. Clearly, both studies demonstrate that paragraph position or function within a dialogue induces certain perceptually and communicatively relevant adaptations to sentence prosody.

The form tests discussed so far address prosody globally. An analytic approach to prosodic evaluation using judgment testing was proposed by =1 (

; Bladon 1990) and co-workers. They developed an elaborate check list of formal properties that should be satisfied by any speech output system that claims to generate English melodies. Trained (but phonetically naive) judges listen to synthetic utterances, while looking at orthographic transcripts of the utterance with a crucial word or syllable underlined. Their task is to check whether the target syllable does in fact contain the melodic property prescribed by the check list. Although this idea is attractive from a diagnostic point of view (melodic flaws are immediately identified) the system has some drawbacks that should be considered before extending its use to other materials and other languages. First, drawing up a valid checklist presupposes a theory of intonation, or at least a detailed and valid description of the test sentences. Workable theories and descriptions may be provided for English, but will not be available for all (EU) languages. Second, even for English, the criteria for each melodic check point were formulated in rather crude terms, which makes it difficult for the judges to determine whether the utterance does or does not satisfy the criterion. Third, it is impossible to determine the overall quality of the melodies tested, since there is no way of combining the pass/fail scores for the various check points into a weighted overall score. A preliminary experiment revealed that three output systems could be meaningfully rank-ordered along a quality scale, but not at the interval measurement level. Systems that were clearly different as judged by experts were very close to each other in terms of their unweighed overall score, whereas systems that were rated as equally good by experts differed by many points. For these reasons, we do not recommend analytic judgments by naive listeners using a check list as an approach to evaluating prosody.

There is (at least) one judgment test that assesses how well certain communicative functions are signalled by prosody at a higher level. The SAM Prosodic Function Test (see I in Appendix 1) asks for ratings of the communicative appropriateness of melodies in the context of plausible human-machine dialogue situations. The test was applied to human-machine dialogues designed to simulate a telephone enquiry service giving flight information ( =1 (

; Grice Vagges Hirst 1992a) ). A restricted set of contexts and illocutionary acts were included: asking (seeking information, seeking confirmation), assertive (conclude, put forward, state), expressive (greet), and commissive (offer, propose to). Two intonation versions were compared, one based on an orthographic input with punctuation (target intonation algorithm) and the other based on a text input edited to conform to the type of text generated by an automatic language generator (reference intonation algorithm). The test should be seen as a first attempt to evaluate paralinguistic appropriateness of intonation in dialogue situations. For general comparative purposes, it would be useful to have an agreed-upon, systematic inventory of situations or speech acts one would want to include, taking as a point of departure, for example, the classification of speech acts proposed by =1 (

; Searle 1979) .

Finally, we are not aware of tests asking subjects to judge the quality of the expression of emotions and attitudes in synthetic speech. It would appear that functional testing of these qualities is preferred in all cases.

Functional tests of prosody

Evaluating speech output prosody using functional tests is even more in its infancy. Since prosody is highly redundant given the segmental information (with the exception of the signalling of sentence type and emotion/attitude), it can be functionally tested only if measures are taken to reduce its redundancy. The first course of action, then, has been to concentrate on atypical, rather contrived materials in which prosody is non-redundant. That is, the materials consist of segmental structures that would be ambiguous without the prosody, and listeners are asked to solve the ambiguity. To the extent that the disambiguation is successful, the speech output system can be said to possess the appropriate prosodic functions. We find examples of such functional tests for the disambiguation of minimal stress pairs (for a survey, see =1 (

; Beckman 1986) , word boundaries (for a survey, see =1 (

; Quené 1993) , constituent structure (syntactic/prosodic bracketing; e.g. =1 (

; Lehiste Olive Streeter 1976) , sentence type (e.g. =1 (

; Thorsen 1980) ), and focus distribution (e.g. =1 (

; Nooteboom Kruijt 1987) ). However, in these kinds of study speech output assessment typically was not the primary research goal. Rather, speech synthesis was used here by psycholinguists or experimental phoneticians to manipulate the speech parameters in a controlled fashion.

The second route is to make prosody less redundant by degrading the segmental quality, such that without prosody (i.e. in the baseline conditions identified above) the intelligibility of the speech output would be extremely poor. The quality of the prosody would then be measured in terms of the gain in intelligibility, i.e. increase in percent correctly reported linguistic units (phonemes, morphemes, words) due to the addition of prosody.

=1 (

; Carlson Granström Klatt 1979) measured intelligibility of utterances synthesised by MITalk with and without application of vowel duration, consonant duration and boundary marking rules (see above). They found that adding duration rules improved word intelligibility; adding within-sentence boundaries, however, did not boost intelligibility (even though the result was judged to be more natural, see above). =1 (

; Scharpff Van Heuven 1988) demonstrate that adding within-sentence boundaries (i.e. changing the temporal organisation) does improve word intelligibility (especially for monosyllabic words) in Dutch diphone synthesis, and that utterances with pauses were judged as more pleasant to listen to (but only when listeners were unfamiliar with the contents of the sentence, ( =1 (

; Van Heuven Scharpff 1991) ). Reasoning along the same lines, one would predict that quality differences in speech melody would have an effect on word recognition in segmentally degraded speech. Such effects were, in fact, reported by

=1 (

; Maassen Povel 1985) , who used (highly abnormal) speech utterances produced by deaf speakers resynthesised with corrected temporal and/or melodic organisation.

There is a substantial literature on the perception of emotion and attitude in human speech (for a survey, see =1 (

; Murray Arnott 1993) ). Typically, listeners are asked to indicate which emotion they perceive in the stimulus utterance, in open or closed response format. Predictably, the larger the set of response alternatives, the poorer the identification of each emotion. It is not clear, in this context, how many different emotions should be distinguished, and to what extent these can be signalled by phonetic means. Still, results tend to show that the most basic emotions can be identified, in lexically neutral utterances, at better than 50% correct, in a 10 alternative closed response test. Synthesis of emotion in speech output is being attempted by several research groups. Preliminary evaluation of emotion-by-rule in Dutch diphone synthesis was presented by

=1 (

; Vroomen Collier Mozziconacci 1993) , as summarised in Figure :

Figure: Percent correctly recognised emotions-by-rule in Dutch diphone synthesis (two diphone sets, obtained from different speakers) and in human speech (after =1 (

; Vroomen Collier Mozziconacci 1993) )

Voice characteristics

Functions of voice characteristics

Whereas the segmental and prosodic features of speech are continuously varying, voice characteristics are taken to refer to aspects of speech which generally remain relatively constant over longer stretches of speech. Voice characteristics, also referred to as voice quality (cf. =1 (

; Laver 1991) ), can most easily be viewed as the background against which segmental and prosodic variation is produced and perceived. In our definition, it includes such varied aspects of speech as mean pitch level, mean loudness, mean tempo, harshness, creak, whisper, tongue body orientation, dialect, accent, etc. Voice quality is mainly used by the listener to form a (sometimes incorrect) idea of the speaker's

Mood and personality (cheerful, reliable, dominant),
Physical size (tall, large, strong),
Sex (male, female),
Age (child, young adult, aged),
Regional background (globally ``from the North'' or more precisely ``from London, Paris, or New York''),
Socio-economic status (high/low education),
Health (cold),
Identity.

In principle voice quality is not communicative, i.e. not consciously used by the speaker to make the listener aware of something which he was not previously aware of, but informative, which means that, regardless of the intention of the speaker, it is used by the listener to infer information. This information may have practical consequences for the continuation of the communicative interaction, since it may influence the listener's attitudes towards the speaker in a positive or negative sense and may affect his interpretation of the message (cf. =1 (

; Laver 1994) ).

Since recently, increased attention is being paid to voice quality aspects of synthetic speech. In fact, =1 (

; Sorin 1994) regards the successful creation of personalised synthetic voices (``personalised TTS'') as one of the most ambitious challenges of the near future. This aspect of synthesis is, for example, relevant in such applications as Translating (Interpreting) Telephony services, where along with translating the content of the message the original voice of the speaker has to be reconstructed (automatic voice conversion). Moreover, the correct encoding of speaker characteristics such as sex, age, and regional background is also relevant for the reading of novels for the blind. Finally, a third application is to be found in non-speaking disabled individuals, who have to use a synthetic speech to replace their own.

With a view of the latter application, =1 (

; Murray Arnott 1993) ) describe a system allowing rapid development of new voice ``personalities'' for the DECtalk synthesiser with immediate feedback to the user. Voice alteration is done by interpolating between the existing DECtalk voices (five male voices, five females voices, and a unisex child). Thus a voice may be created that sounds ``a bit like Paul with a bit of Harry''. A somewhat different approach aimed at a somewhat different type of application is described by

=1 (

; Yarrington Foulds 1993) , who use original recordings of speakers who know they are going to lose their voice to construct speaker specific diphone sets.

Voice characteristics tests

Apart from specific requirements imposed by concrete applications, a general requirement of the voice quality of synthetic output is that it should not sound unacceptably unpleasant. Voice pleasantness is one of the scales included in the overall quality test proposed by the ITU-T to evaluate synthetic speech transmitted over the telephone (see L in Appendix 1). It has also been used by =1 (

; Van Bezooijen Jongenburger 1993) in a field test to evaluate the functioning of an electronic newspaper for the blind. In this test, 24 visually handicapped rated the pleasantness of voice of two synthesisers on a 10-point scale (1: extremely bad, 10: extremely good). Ratings were collected at three points in time: (1) in a first confrontation with the synthesis output, (2) after one month, and (3) after two months of ``reading'' the newspaper. Interestingly, the pleasantness of voice ratings were found not to change over time, in contrast to the intelligibility ratings, which reflected a strong learning effect. From this it was concluded that voice quality has to be good right from the start; one cannot count on the beneficial effect of habituation. Both synthesis systems were generally considered good enough for the reading of popular scientific books and newspapers. However, partly due to the unpleasant voice quality, they were found unfit for the reading of novels or poetry ( =1 (

; Jongenburger Van Bezooijen 1992) ). So, voice quality mainly seems to play a role when attention is directed to the form of the message, for recreational purposes. Finally, we hypothesise that perhaps more than for aspects of speech affecting comprehension, motivation and positive attitude might compensate for poor voice quality.

Of course, judgment studies such as these can only provide global information; if results are negative, no diagnostic information is available as to what voice quality component should be improved. There are no standard tests to diagnostically evaluate the voice quality characteristics of speech output. This type of information could in principle be obtained by means of a modular test, where various acoustic parameters affecting voice quality are systematically varied so that their effect on the evaluation of voice quality can be assessed. This would be the most direct approach.

A more indirect approach would involve asking subjects to listen analytically to and rate various aspects of voice quality on separate scales. A potentially useful instrument for obtaining a very detailed description is the Vocal Profile Analysis Protocol developed by

=1 (

; Laver 1991) . This protocol, which comprises more than 30 voice quality features, requires extensive training. If data are available for several synthesis outputs the descriptive voice quality ratings could be used to predict the overall pleasantness of voice ratings.

It may also be possible to use untrained listeners, although the number of aspects described will necessarily be more limited and less ``phonetic''. Experience with human speech samples representing various voice quality settings =1 (

; Van Bezooijen 1986) has shown that naive subjects can reliably describe 1-minute speech samples with respect to the following 14 voice quality scales: warm--sharp, smooth--rough, low--high, soft--loud, nasal--free of nasality, clear--dull, trembling--free of trembles, hoarse--free of hoarseness, full--thin, precise--slurred, fast--slow, accentuated--unaccentuated, expressive--flat, and fluent--halting. Again, if descriptive ratings of this type were available for synthetic speech they could be correlated with global ratings of synthesised voice quality. Alternatively, this type of scale could also be used more directly for diagnostic purposes, i.e. subjects could be asked to rate each of these voice quality aspects on a 10-point scale, with 1: extremely bad and 10: extremely good.

However, as mentioned above, experience with detailed perceptual descriptions of voice quality is as yet limited to non-distorted human speech. It remains to be assessed whether such descriptions can also be reliably made for synthetic speech. And even if this proved to be the case, the translation of the results obtained to actual system improvement is not unproblematic, since not much is known about the acoustic basis of perceptual ratings. Attempts in this direction have been rather disappointing (e.g. =1 (

; Boves 1984) ).

In addition to judgment tests to evaluate the formal aspects of voice quality, functional tests may be used to assess the adequacy of voice quality. Although here also no standard tests are available, the procedures are rather straightforward and dictated directly by application requirements. One can think, for example, of tests in which subjects are asked, in an open or closed response format, to identify the speaker. This would be useful in an application where one tries to construct a synthetic voice for a given speaker or reconstruct the natural voice of a given speaker. Or one can ask people to identify the speaker's sex, or estimate his age or other characteristics.

In this context, accent and dialect features are relevant as well. For example, for Dutch a new set of diphones was derived from a western speaker, because some non-speaking users complained that the old diphone set had too much of a southern accent to be acceptable for communication in their living environment. To test whether naive listeners were in fact able to discriminate between the two diphone sets, listeners from different parts of the Netherlands rated CVC, VCV, and VCCV stimuli produced with the two systems on a 10-point bipolar regional accent --- standard Dutch scale. The diphone sets were indeed clearly discriminable ( =1 (

; Van Bezooijen 1988) ).

Summarising it can be stated that very little experience has as yet been gained with the diagnostic and comparative evaluation of voice quality of speech output systems, either by means of judgment or functional tests. Moreover, except for specific applications where synthesis is closely connected with the identity of a speaker (in a clinical or automatic voice conversion setting), it is not even clear how much importance is attached to voice quality by naive listeners. How much does it people really bother when voice quality is unpleasant? For example, does an unpleasant voice quality prevent them from using a synthetic information service? We think it is too early to give concrete recommendations how to approach the evaluation of voice quality aspects of speech output. It is one of the topics for the near future.

Relationships among tests

Knowledge about the relationships among tests is important for at least two reasons:

It allows a better interpretation of the meaning and validity of the test results obtained.
It can be used to decide upon the test suite which gives a complete picture of all relevant aspects of speech output without being redundant; it is no use employing two tests which (to a large extent) yield the same information.

What would be needed to assess the relationships among tests is a large scale study which compares the performance of all ``serious'' tests testing the same aspect (e.g. intelligibility or comprehension) for a wide range of synthesisers. One would then like to know the stable differences among the tests in quality measured (e.g. percentage correct), as well as the correlations among the rank orderings of the synthesisers. In addition, it would be useful to have information on the reliability of ``identical'' tests developed and applied to a wide variety of different languages.

Some differences between the results obtained with different tests can be predicted to some extent. For example, when considering intelligibility, we think at least four factors will affect the outcomes: Intelligibility can be expected to increase

as the unit of measurement is smaller (it is easier to identify one phoneme correctly than a sequence of phonemes),
as the structure of the test items is more predictable (fixed versus open structure),
as the combination of phonemes is more predictable (meaningful versus meaningless),
as the number of response categories is smaller (closed versus open).

These predictions can be tested by looking at actual intelligibility results. =1 (

; Jekosch Pols 1994) , for example, assessed the intelligibility of one German synthesiser by means of four different tests (all described in Appendix 1):

The SAM Standard Segmental Test (open response).
A German variant of the MRT (closed response).
The CLID test (open response).
The SUS test (open response).

Percent correct elements (phonemes in the SAM Standard Segmental Test, clusters in the CLID test, words in the MRT and the SUS test) differed widely, from 19% to 85%. The lowest percentage was obtained for the SUS test, followed by the SAM Standard Segmental Test, the CLID test, and the MRT. The fact that the highest score was obtained with the MRT agrees with our predictions, since this test possesses not a single aspect with a negative effect on intelligibility: The unit of measurement is small (phoneme), the structure is fixed (CVC), the items are meaningful, and the response set is closed (six categories). The results for the other three tests point to complex interactions among the four factors.

=1 (

; Delogu Paoloni Sementina 1992) compared four different test methods for evaluating the overall quality of Semantically Unpredictable Sentences produced by a male speaker (once with and once without noise added), three synthesisers, and three vocoders:

Categorical estimation (5-point scale from ``excellent'' to ``bad'').
Magnitude estimation (quality expressed by any number between 1 and 100).
Paired comparison (``which of the two realisations do you prefer?'').
Word monitoring (``push a button as soon as a specified target word is recognised'').

Very high correlations were obtained among categorical estimation, magnitude estimation, and paired comparison (r>.90); somewhat lower but still high correlations were found between these three test methods and reaction time (r around .80). Reaction time showed the smallest variation in the responses, but the least discriminatory power. The best discrimination among the systems was obtained with paired comparisons.

=1 (

; Silverman Basson Levas 1990) compared the results of the Bellcore intelligibility test (see C in Appendix 1) with a comprehension test in which subjects had to answer questions related to the content of synthesised utterances with yes, no, or can't tell from the information provided. The faster subjects answered questions, the more items they heard. Two synthesisers, A and B, were tested. The intelligibility test yielded higher percentages correct for A than for B (77% versus 70%), whereas the comprehension test yielded higher percentages correct for B than for A (69% versus 63%). A few remarks are in order when attempting to interpret these seemingly contradictory results:

The presence of prosody in the comprehension test may have played a role.
Silverman et al. rightly state that the intelligibility scores relate to utterance-initial and utterance-final phonemes only, which hardly occur in sentence-length material (and running speech in general, for that matter).
It appeared that subjects generally heard fewer questions from B than from A in the test time allotted, a finding the meaning of which is complicated by the fact that B spoke at a slower rate than A.

Whatever the exact basis of the opposite rank orders yielded by the two tests, it is clear that caution should be exercised when generalising from a laboratory-type intelligibility test to a field-type, application-oriented comprehension test. Low correlations between intelligibility (MRT) and comprehension are also reported by

=1 (

; Ralston et al. 1991) .

In general, studies comparing different tests comprehend only a limited number of systems, which makes it difficult to learn to what extent the different tests rank the systems in the same way. Moreover, the relationship between the results yielded by glass-box and black-box tests deserves more systematic attention. We think that the importance of further studies of the relationships among tests cannot be stressed enough, if one wants to have a good idea of the meaning and generality of results obtained.

Next: Further developments in Up: Synthesis assessment Previous: Black box approach

WWW Administrator
Fri May 19 11:53:36 MET DST 1995