Next: Appendix 1: Summary Up: Synthesis assessment Previous: Further developments in

List of recommendations

Recommendation 1
Use a glass box approach if you want diagnostics in order to improve your speech output system.

Recommendation 2
Use a black box approach if you want to assess the overall performance of speech output systems.

Recommendation 3
Do not rely on laboratory tests. As soon as there is a discrepancy between the laboratory setting and the true field situation (in terms of environment, tasks, type of listener) field testing is necessary.

Recommendation 4
Always exclude hearing-impaired subjects from speech output assessment. Within the SAM-project ( =1 (

; Howard-Jones 1992a) , =1 (

; Howard-Jones 1992b) ) it is specified that subjects should pass the hearing screening test at 20 dB HL at all octave frequencies from 500 to 4000 Hz.

Recommendation 5
Never use the same subject more than once, unless, of course, one is interested in the effect of repeated exposure.

Recommendation 6
In diagnostic testing only include subjects speaking the same language (variety) as the language (variety) tested.

Recommendation 7
In diagnostic testing, hire a trained phonetician (with a basic understanding of the relationships between articulation and acoustics) in the initial stages of development of a system in order to obtain subtle information (e.g. degree of voicing in plosives), or information that is usually not used for functional purposes in real-life communication (e.g. formal aspects of temporal organisation and intonation, cf. =1 (

; Terken 1993) .

Recommendation 8
In specialised applications, select subjects who are representative of the (prospective) users. For example, synthesis integrated in a reading machine for the blind should be tested with visually handicapped. And synthesis for long-term use should be tested with subjects with different degrees of experience and familiarisation with the type of synthetic speech of interest.

Recommendation 9
Synthesis to be used by the general public for incidental purposes, i.e. which should be functionally adequate in a first confrontation, should be tested with a wide variety of subjects, including people with a limited command of the language, dialect speakers, and people of different ages. However, none of them should have any experience in listening to synthetic speech. In telecommunications research, groups of between 12 and 16 subjects (all with English as their primary language) have been found sufficient to obtain stable mean values in judgment tests.

Recommendation 10
Use categorical estimation in rapid judgment tests with a view of internal comparison, and when you do, use at least a 10-point scale.

Recommendation 11
Use magnitude estimation for test external comparison, and when you do, use line length with imaginary ideal speech as a reference.

Recommendation 12
Absolute segmental topline: In the case of allophone synthesis, use human speech produced by a designated talker, i.e. the same individual on whose speech the table values and synthesis rules were based, or who, in the case of concatenative synthesis, provided the basic synthesis building blocks. The absolute topline reference will then be based on CD-quality digital speech.

Recommendation 13
Relative segmental topline for parametric synthesis: A second useful topline reference is the human reference speech (see Recommendation 12) but analysed and (re-)synthesised using exactly the same coding scheme that is employed in the speech output system to be tested.

Recommendation 14
Relative segmental topline for waveform concatenation: Use the same (lower) bitrate in the reference condition as in the speech output system.

Recommendation 15
Segmental baseline for allophone synthesis: Use speech in which all segments retain their table values and are strung together merely by smoothing spectral discontinuities at segment boundaries.

Recommendation 16
Segmental baseline for concatenative synthesis: Use speech made by stringing together coarticulatory neutral phones (i.e. stressed vowels spoken between two /s/-es, or stressed consonants preceded by schwa and followed by an unrounded central vowel, cf. the `neutrone' condition in =1 (

; Van Bezooijen Pols 1993) ). Minimal smoothing should be applied to avoid spectral jumps.

Recommendation 17
Temporal and melodic topline: Copy, as accurately as possible within the limitations of the synthesiser, the temporal structures and speech melodies of a single designated professional human speaker onto the synthetic speech output.

Recommendation 18
Temporal baseline: Use a condition in which the smallest synthesis building blocks (phoneme, diphone, demisyllable) retain their original, unmanipulated durations as they were copied from the human original from which they were extracted (or, in the case of allophone synthesis, the phoneme duration table values, cf.

=1 (

; Carlson Granström Klatt 1979) ).

Recommendation 19
Melodic baselines: Synthesise utterances on a monotone, at a pitch level that coincides with the average pitch of the test items. Also, include a random melodic reference for the sake of validation, by introducing random pitch variations (in terms of excursion size, rate of change, and segmental alignment), within physiologically and linguistically reasonable limits and with a mean pitch equal to the average of the test items.

Recommendation 20
Use time-frequency warping of optimal human speech to create a grid of overall quality reference conditions.

Recommendation 21
Try to avoid the use of functional tests to assess overall output quality: on-line reaction time tests are difficult to interpret and off-line comprehension tests are difficult to develop.

Recommendation 22
If determined to develop a comprehension test, beware of the fact that reading tests may be too compact to be used as listening tests; adapt the materials or use materials that are meant to be listened to.

Recommendation 23
Use open comprehension questions rather than closed ones, the former being more sensitive than the latter.

Recommendation 24
When administering a comprehension test, include a top-line reference with a dedicated speaker realising exactly the same texts presented in a synthetic version; use different groups of subjects for the various speech conditions (or better still block conditions over listeners such that no listener hears more than one version of the same text while at the same time each listener gets an equal number of different text versions).

Recommendation 25
When interpreting comprehension results, look at difference scores (synthetic compared to human) rather than at absolute scores to abstract from the intrinsic difficulty of questions.

Recommendation 26
Since there is no consensus on the most appropriate judgment scales to evaluate overall quality, choose between:

Intelligibility, naturalness, and acceptability (Sam Overall Quality test),
Acceptance, overall impression, listening effort, and comprehension problems (ITU-T), or only listening effort (practice in telephony),
Intelligibility, general quality, and naturalness ( =1 (
; Van Bezooijen Jongenburger 1993) ).

Recommendation 27
It is important that the scale positions have a clear meaning for the subjects and that the scale is wide enough to allow differentiation among systems compared. Use at least a 10-point scale.

Recommendation 28
Use the CLID Test for the evaluation of the segmental intelligibility at the word level, both for diagnostic and comparative purposes (in the latter case the stimulus set can be smaller).

Recommendation 29
Use the SUS Test to evaluate intelligibility for comparative purposes at the sentence level.

Next: Appendix 1: Summary Up: Synthesis assessment Previous: Further developments in

WWW Administrator
Fri May 19 11:53:36 MET DST 1995