Recommendation 1
Use a glass box approach if you want
diagnostics in order
to improve your speech output system.
Recommendation 2
Use a black box approach if you want to assess the
overall performance of speech output systems.
Recommendation 3
Do not rely on laboratory
tests. As soon as there is a
discrepancy between the
laboratory setting and the true field situation (in terms of
environment, tasks, type of
listener) field testing is necessary.
Recommendation 4
Always exclude hearing-impaired
subjects from speech
output assessment. Within the
SAM-project (
=1 (
; Howard-Jones 1992a) , =1 (
; Howard-Jones 1992b) ) it is specified that subjects should pass the hearing screening test at 20 dB HL at all octave frequencies from 500 to 4000 Hz.
Recommendation 5
Never use the same subject more than once, unless, of
course, one is interested in the
effect of repeated exposure.
Recommendation 6
In diagnostic testing only include
subjects speaking the
same language (variety) as
the language (variety) tested.
Recommendation 7
In diagnostic testing, hire a trained phonetician (with a
basic understanding of the
relationships between articulation and acoustics) in
the
initial stages of development of a
system in order to obtain subtle information (e.g. degree of
voicing in plosives), or
information that is usually not used for functional purposes
in real-life communication
(e.g. formal aspects of temporal
organisation and intonation,
cf.
=1 (
; Terken 1993) .
Recommendation 8
In specialised applications, select subjects who are
representative of the (prospective)
users. For example, synthesis integrated in a reading
machine
for the blind should be
tested with visually handicapped. And synthesis for long-term
use should be tested with
subjects with different degrees of experience and
familiarisation with the type of synthetic
speech of interest.
Recommendation 9
Synthesis to be used by the general public for incidental
purposes, i.e. which should
be functionally adequate in a first confrontation, should be
tested with a wide variety of
subjects, including people with a limited command
of the
language, dialect speakers, and
people of different ages. However, none of them should have
any experience in listening to
synthetic speech. In telecommunications research, groups of
between 12 and 16 subjects
(all with English as their primary
language) have been found
sufficient to obtain stable
mean values in judgment tests.
Recommendation 10
Use categorical estimation in rapid judgment tests with a
view of internal
comparison, and when you do, use at least a 10-point
scale.
Recommendation 11
Use magnitude estimation for test external comparison,
and when you do, use line
length with imaginary ideal speech as a reference.
Recommendation 12
Absolute segmental topline: In the case of
allophone
synthesis, use human speech
produced by a designated talker, i.e. the same individual on
whose speech the table values
and synthesis rules were based, or who, in the case of
concatenative synthesis, provided
the basic synthesis building
blocks. The absolute topline
reference will then be based on
CD-quality digital speech.
Recommendation 13
Relative segmental topline for parametric synthesis: A
second useful topline reference
is the human reference speech (see
Recommendation 12) but
analysed and (re-)synthesised
using exactly the same coding scheme that is employed in the
speech output system to be tested.
Recommendation 14
Relative segmental topline for waveform concatenation:
Use the same
(lower) bitrate
in the reference condition as in the speech output system.
Recommendation 15
Segmental baseline for allophone synthesis: Use speech in
which all segments retain
their table values and are strung together merely by
smoothing
spectral discontinuities at segment boundaries.
Recommendation 16
Segmental baseline for concatenative synthesis: Use
speech made by stringing together
coarticulatory neutral phones (i.e. stressed vowels spoken
between two
/s/-es, or stressed
consonants preceded by schwa and followed by an unrounded
central vowel, cf. the
`neutrone' condition in
=1 (
; Van Bezooijen Pols 1993) ). Minimal smoothing should be applied to avoid spectral jumps.
Recommendation 17
Temporal and melodic topline: Copy, as accurately as
possible within the limitations
of the synthesiser, the temporal structures and speech
melodies of a single designated
professional human speaker onto the synthetic speech
output.
Recommendation 18
Temporal baseline: Use a condition in which the smallest
synthesis building blocks
(phoneme, diphone, demisyllable) retain their original,
unmanipulated durations as they
were copied from the human original
from which they were
extracted (or, in the case of
allophone synthesis, the phoneme duration table values, cf.
=1 (
; Carlson Granström Klatt 1979) ).
Recommendation 19
Melodic baselines: Synthesise utterances on a
monotone,
at a pitch level that coincides with the average pitch of the test
items. Also, include a random melodic reference for
the sake of validation, by introducing random pitch variations
(in terms of excursion size,
rate of change, and segmental
alignment), within
physiologically and linguistically
reasonable limits and with a mean pitch equal to the average
of the test items.
Recommendation 20
Use time-frequency warping of optimal human speech to
create a grid of overall
quality reference conditions.
Recommendation 21
Try to avoid the use of functional tests to assess
overall output quality: on-line
reaction time tests are difficult to interpret and off-line
comprehension tests are difficult to
develop.
Recommendation 22
If determined to develop a comprehension test, beware of
the fact that reading tests
may be too compact to be used as listening tests; adapt the
materials or use materials that are meant to be listened
to.
Recommendation 23
Use open comprehension questions rather than closed ones,
the former being more sensitive than the latter.
Recommendation 24
When administering a comprehension test, include a
top-line reference
with a
dedicated speaker realising exactly the same texts presented
in a synthetic version; use
different groups of subjects for the various speech conditions
(or better still block
conditions over listeners such that no listener hears more
than one
version of the same text
while at the same time each listener gets an equal number of
different text versions).
Recommendation 25
When interpreting comprehension results, look at
difference scores (synthetic
compared to human) rather
than at absolute scores to abstract
from the intrinsic difficulty of questions.
Recommendation 26
Since there is no consensus on the most appropriate
judgment scales to evaluate overall
quality, choose between:
; Van Bezooijen Jongenburger 1993) ).
Recommendation 27
It is important that the scale positions have a clear
meaning for the subjects and that
the scale is wide enough to allow differentiation
among
systems compared. Use at least a 10-point scale.
Recommendation 28
Use the CLID Test for the evaluation of the segmental
intelligibility at the word level,
both for diagnostic and comparative purposes (in the latter
case the
stimulus set can be smaller).
Recommendation 29
Use the SUS Test to evaluate intelligibility for
comparative purposes at the sentence level.