The great majority of speech output assessment techniques use listening experiments involving human subjects, i.e. functional and/or judgment tests of speech output at the acoustic level. In the following subsections we will discuss a number of methodological issues that are relevant especially to this type of testing. The issues concern the choice of subjects, test procedures, benchmarks and reference conditions, and precautions to ensure cross-language comparability. Although there is no a priori reason why this should be so, no accepted methodology seems to exist for other types of speech output evaluation techniques. As will be obvious from later sections, for example, no accepted methodology can be identified in the field of output evaluation at the symbolic linguistic level. It is unclear in this area what kinds of textual materials should be used in tests, what error categories should be distinguished, and what scoring procedures should be used. We will therefore limit the methodological discussion to acoustic output testing techniques involving human listeners.
One of the most important aspects of a measuring instrument is its reliability. How reliable, for example, is subjects' performance in functional intelligibility tests when tested several times? Test/retest intrasubjective reliability of intelligibility was assessed by
=1 (
; Logan Greene Pisoni 1989) ) and =1 (
; Van Bezooijen 1988) ); in both cases it was found to be good. More attention has been paid to subject dimensions systematically affecting intersubjective reliability. This research was motivated by the finding of large variance in the test scores, possibly obscuring effects of the synthesis systems compared. Most studies in this area examined variability in intelligibility scores. Subject dimensions considered relevant include: age, non-expert experience with synthetic speech, expert experience with synthetic speech, and analytic listening.
Within the ESPRIT-SAM project ( =1 (
; Howard-Jones 1992a) ,
=1 (
; Howard-Jones 1992b) ), the effect of age was examined with Italian VCV-items. Five age categories were distinguished (10--19, 20--29, 30--44, 45--59, over 60), with between 5 and 8 subjects per group. The group scores of percentages correct consonant identification ranged from 58%, for the oldest group, to 64%, for the youngest group. So, little evidence was found for an effect of the subject dimension age.
Non-expert experience with synthetic speech was investigated in several studies.
=1 (
; Howard-Jones 1992a) , =1 (
; Howard-Jones 1992b) compared the performance of 8 subjects experienced with synthetic speech and 24 inexperienced subjects. German VCV-items were presented. The mean score for the experienced subjects was 79%, that for the inexperienced subjects 62%. There is further evidence that the intelligibility of synthetic speech increases as a result of non-expert experience with synthetic speech, both when acquired in the form of training with feedback (e.g. =1 (
; Greenspan Nusbaum Pisoni 1985) ; =1 (
; Schwab Nusbaum Greene 1985) ) and when acquired in a more natural way without feedback ( =1 (
; Pisoni Greene Nusbaum 1985) , =1 (
; Pisoni Nusbaum Greene 1985) ;
=1 (
; Boogaart Silverman 1992) ). The learning effect has been found to manifest itself after only a few minutes of exposure. However, there are indications that the effect of learning depends on the type of synthesis used. =1 (
; Jongenburger Van Bezooijen 1992) ) assessed the intelligibility of two synthesis systems used by visually handicapped for reading a digital daily newspaper in a first confrontation and after one month of experience. An open response CVC identification test was used. For one system, which was allophone based, consonant intelligibility increased from 58% to 79%; for the other system, which was diphone based, intelligibility increased from 63% to 68%. It was hypothesised that the characteristics of allophone-based synthesis are easier to learn because they are rule-governed and therefore more invariant than those of diphone-based synthesis. Moreover, no transfer was found from experience with one type of synthesis to the understanding of the other type of synthesis. This suggests that there is no such thing as general experience in listening to synthetic speech.
The subject dimension expert experience with synthetic speech was examined by
=1 (
; Howard-Jones 1992a) with English VCV-items. A percentage correct consonant identification of 30% was obtained for the inexperienced subjects versus 49% for the experts. So, again improved performance was found as a function of increased exposure.
The last subject dimension we want to mention is experience in listening analytically to speech. On the basis of a reanalysis of the results from a number of their evaluation studies, =1 (
; Van Bezooijen Pols 1993) ) conclude that the more ear-training subjects have, the higher the percentages correct they attain. Furthermore, ear-training was found to result in a reduction of intersubjective differences.
Apart from variance which can be attributed to particular subject dimensions, much apparently individual variability is found in test scores. =1 (
; Hazan Shi 1993) examined the variance in subject scores in various tests, including intelligibility of meaningless VCV-items, intelligibility of Semantically Unpredictable Sentences (SUS, see Appendix 1 G), and speech pattern identification for plosive place and voicing contrasts. A homogeneous group of subjects was used:
Despite the homogeneity of the subject group, a sizeable degree of variability was found in all tests. For the SUS the range (i.e the difference between the best and worst performing subject) was 28%, for the CVC-test the range was 47%. At the level of speech pattern processing, considerable differences were found in the perceptual weighting given to individual cues to plosive place and voicing contrasts. Hazan & Shi attribute the variability not to audiological differences among listeners, but to the development of different perceptual strategies during language acquisition. They distinguish two types of listeners: ``auditors'' (i.e. users of acoustic information) and ``comprehenders'' (i.e. users of global contextual information).
Having established that there is much variability in the scores obtained in speech output evaluation tests, part of which can be attributed to clearly identifiable subject dimensions such as previous experience with synthetic speech, one may wonder what implications this has for the selection of subjects in specific tests. We think that the implications for subject selection depend in part on the type of test administered: black box versus glass box, comparative versus diagnostic, application for the general public versus application for a specific group of users, etc. However, some general recommendations can be formulated as well.
Recommendation 4
;
Howard-Jones 1992a)
,
=1 (
;
Howard-Jones 1992b)
) it is specified that
subjects should pass the
hearing screening test at 20 dB HL at all octave
frequencies
from 500 to 4000 Hz.
Always exclude hearing-impaired
subjects from speech
output assessment. Within
the SAM-project (
=1 (
Recommendation 5
Never use the same subject more than once, unless, of
course, one is interested in
the effect of repeated exposure.
Recommendation 6
In
diagnostic testing only include subjects speaking
the same language (variety) as
the language (variety) tested.
Recommendation 7
;
Terken 1993)
).
In diagnostic testing, hire a trained phonetician (with
a basic understanding of the
relationships between articulation and acoustics) in the
initial stages of development of
a system in order to obtain subtle information (e.g. degree
of voicing in plosives), or
information that is usually not used for functional purposes
in real-life
communication
(e.g. formal aspects of temporal organisation and intonation,
cf.
=1 (
The following approach is recommended not only because of (possible) differences in the perception of the speech output, but also because motivation is known to play an important role in the effort people are willing to spend in order to understand suboptimal speech. If people have a choice between human and synthetic speech, the synthetic speech will have to be good if it wants to have a chance of being accepted. However, if people do not have a choice, e.g. the visually handicapped who without synthesis (or braille) will not have access to a daily newspaper, synthesis will be accepted more easily.
Recommendation
8
In specialised applications, select subjects who are
representative of the
(prospective) users. For example, synthesis integrated in a
reading machine for the
blind should be tested with visually handicapped. And
synthesis for
long-term use
should be tested with subjects with different degrees of
experience and familiarisation
with the type of synthetic speech of interest.
Recommendation 9
Synthesis to be used by the general public
for
incidental purposes, i.e. which
should be functionally adequate in a first confrontation,
should be tested with a wide
variety of subjects, including people with a limited command
of the language, dialect
speakers, and people of different ages.
However, none of them
should have any
experience in listening to synthetic speech. In
telecommunications research, groups of
between 12 and 16 subjects (all with English as their primary
language) have been
found sufficient to obtain stable mean
values in judgment
tests.
As indicated in section , speech output assessment techniques
can be differentiated along a
number of parameters. No parameters related to the actual test
procedure were included
there. Test
procedures can vary with respect to subjects (see
section
), stimuli, and
response modality.
Stimuli can vary along a large number of parameters, the most important of which are listed below.
In Appendix 1, summary descriptions of tests are given where the stimuli have been categorised along these stimulus parameters.
Response modality can vary along a number of parameters as well. The choice seems to be mainly determined by three factors: comparative versus diagnostic, functional versus judgment, and TTS-development versus psycholinguistic interest. In the five types of response modalities listed below, 1 and 2 are mainly used within the glass box approach (1 in TTS-development, 2 in psycholinguistically oriented research), whereas 3, 4 and 5 are more common in the black box approach. The latter three response modalities can be further differentiated in that 3 and 4 are functional in nature (3 in TTS-development, 4 in psycholinguistically oriented research), whereas 5 represents judgment testing. In the list of response modalities a distinction is made between off-line tests, where subjects are given some time to reflect before responding, and on-line tests, where an immediate response is expected from the subjects, tapping the perception process before it is finished.
Transcription can be:
; Pisoni Greene Nusbaum 1985) , =1 (
; Pisoni Nusbaum Greene 1985) ).
; Pisoni Greene Nusbaum 1985) , =1 (
; Pisoni Nusbaum Greene 1985) .
; Manous et al. 1985) ).
=1 (
; Pavlovic Rossi Espesser 1990) ; =1 (
; Delogu et al. 1991) ; =1 (
; ITU-T 1993) ).
The last response modality will be discussed in some more detail. Pavlovic and co-workers have conducted an extensive series of studies (cf. =1 (
; Pavlovic Rossi Espesser 1990) ) comparing different types of scaling methods that can be used in judgment tests to evaluate speech output. Much attention was paid to:
Pavlovic et al. stress that there are important differences between the two types of scaling methods, for example the fact that categorical estimation results in an interval scale, whereas magnitude estimation results in a ratio-scale. The former leads to the use of raw ratings, the calculation of the arithmetic mean, and the comparison of conditions in terms of differences, the latter leads to the use of the logarithm of the ratings, the geometric mean, and comparison in terms of ratios. The differences also have implications for the type of conclusions to be drawn from the test results. Both the categorical estimation method (with a 20-point scale) and the magnitude estimation method have been included in SOAP as standard SAM Overall Quality test procedures (see K in Appendix 1).
Recommendation 10
Use categorical estimation in rapid judgment tests
with a view of internal
comparison, and when you do, use at least a
10-point
scale.
Recommendation 11
Use magnitude estimation for test external comparison,
and when you do, use line
length with imaginary ideal speech as a reference.
By a benchmark test we mean an efficient, easily administered test, or set of tests, that can be used to express the performance of a speech output system (or some module thereof) in numerical terms. The benchmark itself is the value that characterises some reference system, against which a newly developed system is (implicitly) set off. The benchmark is preferably chosen such that it represents a performance level that is known to guarantee user satisfaction. Consequently, if the performance of a new product exceeds the benchmark, its designer or prospective buyer is assured of at least a satisfactory product, and probably even better. Obviously, testing against a benchmark is more efficient than pairwise or multiple testing of competing products.
At this time it is too early to talk about either existing benchmarks or benchmark tests. It is clear, however, that the development of benchmarking deserves high priority in the speech output assessment field. As a first step, existing tests should be scrutinised for their potential use as benchmark tests. Choices should be make as to what aspects to include in benchmark tests (overall performance, composite performance by a number of crucial modules), and what system to adopt as the reference on which the benchmark value should be based. In this respect, it seems to us that one should not adopt the performance of human speech as the benchmark. Human speech, at least when produced by professional talkers, will simply be too good for the purpose of benchmarking. Since human speech will always be superior to synthetic speech, the quality of the latter will have to be expressed as a fraction, which makes it hard to compare the relative differences between different types of synthetic speech. What we need is a speech output system of proven, but still imperfect, quality. This is, quite probably, the reason why the quality of many speech output systems for English is often expressed relative to the `Paul' voice of MITalk/DECTalk, which has long served as the de facto standard in TTS.
Next to a widely accepted benchmark, it would appear to us that designers of speech output systems should want to know how well their systems perform relative to some optimum, and what performance could be expected of a system that contains no intelligence at all. In other words, the designer is looking for topline and baseline reference conditions. Reference conditions such as these do not yield diagnostic information in the strict sense of the word. However, they do provide the systems developer with an estimate of how much improvement can still be made to a system as a whole (in a black box approach) or to specific modules (in a glass box approach).
There has been no general practice to include
topline and
baseline reference conditions in
segmental quality testing (section ). Still, it seems
to
us that it is important to
reach consensus on a number of measures. If the output system
uses waveform
concatenation techniques, the designer will want to know how
well the synthesis performs
relative to the live human speaker, or to
facilitate
procedures, to some electronic
operationalisation of live speech (e.g. CD quality speech
recorded at a short distance from
the speaker's mouth in a quiet environment). However, if the
system's waveforms have
been coded with a lower bitrate
than CD quality, the designer
should determine to what
extent degradation of system performance is due to the
synthesis itself as opposed to the
non-optimal bitrate. An easy way to determine this, is to
adopt a second reference
condition using the same
(lower) bitrate as the synthesis.
This precaution is even more
necessary for parametric synthesis. Obviously, no type of
parametric synthesis can be
better than the maximum quality that is afforded by the
analysis-resynthesis coding scheme
adopted for
the synthesiser. This requirement can generally be
fulfilled when LPC
synthesis schemes are used. However, for a range of
synthesisers (e.g. the Klatt and the
JSRU synthesisers) no automatic parameter estimation for
straightforward analysis-resynthesis
is possible at this time. The optimal parametric
representation of human
reference materials will then have to be found by trial and
error, or else the attempt should
be abandoned.
The designer of an output system claims that the intelligence incorporated into the synthesis systems (e.g. through rules) makes the systems perform better than with no intelligence built in at all. In order to establish the extent to which this claim is true, a baseline condition is needed which consists in a type of synthetic speech that has no knowledge of speech processes at all.
Recommendation 12
Absolute segmental topline: In the case of allophone
synthesis, use human speech
produced by a designated talker, i.e.
the same individual
on whose speech the table
values and synthesis rules were based, or who, in the case
of concatenative synthesis,
provided the basic synthesis building blocks. The absolute
topline reference will then be
based on CD-quality
digital speech.
Recommendation 13
Relative segmental topline for parametric synthesis: A
second useful topline
reference is the human reference speech (see Recommendation
12) but analysed and (re-)synthesised using
exactly the same coding scheme that is
employed in the speech output
system to be tested.
Recommendation 14
Relative segmental topline for waveform concatenation:
Use the same (lower)
bitrate in the reference
condition as in the speech output
system.
Recommendation 15
Segmental baseline for allophone synthesis: Use speech in
which all segments retain
their table values and are strung together merely by
smoothing spectral
discontinuities
at segment boundaries.
Recommendation 16
;
Van Bezooijen Pols 1993)
). Minimal smoothing should
be applied to avoid spectral
jumps.
Segmental baseline for concatenative synthesis: Use
speech made by stringing
together coarticulatory neutral phones (i.e. stressed
vowels spoken between two
/s/-es,
or stressed consonants preceded by schwa and followed by an
unrounded central vowel,
cf. the `neutrone' condition in
=1 (
The need for suitable topline and baseline reference
conditions has clearly been recognised
in the field of prosody (i.e. temporal and
melodic
structure, cf. section ) testing.
The following are recommendations for prosodic topline and
baseline
conditions. Note
that, in contrast to segmental evaluation, listeners often
find it very difficult to differentiate
between different prosodic versions of an utterance. Therefore
testers often need examples
of `very bad' systems to check whether the
listeners are
indeed sensitive to prosodic
differences.
Recommendation 17
Temporal and melodic topline: Copy, as accurately as
possible within the
limitations of the synthesiser, the temporal structures and
speech melodies of
a single
designated professional human speaker onto the synthetic
speech output.
Recommendation 18
=1 (
;
Carlson Granström
Klatt 1979)
).
Temporal baseline: Use a condition in which the
smallest synthesis building blocks
(phoneme, diphone, demisyllable)
retain their original,
unmanipulated durations as they
were copied from the human original from which they were
extracted (or, in the case of
allophone synthesis, the phoneme duration table values, cf.
This baseline condition, then, contains no intelligence, so that any improvement in the target conditions with duration rules must be due to the added explicit knowledge on duration structure. A reference in which segment durations vary at random (within realistic bounds) can be included for validation purposes, as an example a `very bad system'. Listeners should rate this condition as poorer than any other condition.
Recommendation 19
Melodic baselines: Synthesise utterances on a
monotone, at a pitch level that coincides with the average pitch of the test items. Also,
include a random melodic reference
for the sake of validation, by introducing random pitch
variations (in terms of
excursion
size, rate of change, and segmental alignment), within
physiologically and linguistically
reasonable limits and with a mean pitch equal to the
average of the test items.
There is a practical problem that not every synthesiser allows the generation of monotonous pitch so that some sort of waveform manipulation (e.g. pitch synchronous overlap and add, PSOLA) may have to be used in order to monotonise the synthetic melody.
In the area of voice characteristics (voice quality, section
), the problem of
reference conditions has not been recognised. Generally, there
seems to be little point in
laying down a baseline reference for voice quality. The
choice
of a suitable topline would
depend on the application of the speech output system. If the
goal is personalised speech
output (for the vocally handicapped) or automatic speaker
conversion (as in interpreting
telephony), the obvious topline is the
speaker who is being
modelled by the system, using
the same coding scheme when applicable. When a general purpose
(i.e. non-personalised)
speech output system is the goal, one would first need to know
the desired voice quality,
i.e. ideal voices should
be defined for specific applications,
and speakers should be located
who adequately represent the ideal voices. At this time we
will refrain from making any
further suggestions on this matter. The definition of `ideal'
voices and voice qualities,
and
the implementation of topline references should be a matter of
priority in the near future.
Given the existence of an overall quality topline
reference
condition, it would be advantageous to have a set of reference conditions that are poorer
than the optimum by a
number of calibrated steps until a quality equal to or less
than the baseline reference is
reached (see also section ). Such a set of reference
conditions would yield a grid within
which each type of speech, whether produced by
humans or by
machines, can be located
and compared with other types of speech. Recently, attempts
have been made at creating
such a continuum of reference conditions by taking
high-quality human speech and
applying some calibrated distortion to it, such
as
multiplicative white noise at various
signal-to-noise ratios (`Modulated Noise Reference Unit or
MNRU', cf. ITU-T
Recommendation P.81), or time-frequency warping (TFW, ITU-T
Recommendation P.85,
cf.
=1 (
; Burrell 1991) ; or T-reference, cf. =1 (
; Cartier et al. 1992) ).
TFW introduces greater or lesser (random) deviations from the
mean rate of a recorded
utterance (2.5%,
%, ...
20%) over successive stretches of
150 ms, so that the
speech contains potentially disturbing accelerations and
decelerations and associated
frequency shifts.
=1 (
; Fellbaum Klaus Sotscheck 1994) showed that the MNRU is not suitable for the evaluation of synthetic speech. TFW of natural speech, however, provided a highly sensitive reference grid within which TTS systems could be clearly differentiated from each other in terms of judged listening effort ( =1 (
; Johnston 1993) ). Moreover, Johnston showed that the perceived quality ordering among a range of TTS systems interacts with the sound pressure level at which the speech output is presented.
Recommendation 20
Use time-frequency warping of optimal human speech to
create a grid of overall
quality reference conditions.
Although it is generally agreed that in the final analysis all languages are equally complex, it cannot be denied that phonetic and phonological complexity differs widely from one language to the next. Languages differ in the size of their vowel and consonant inventories, in the complexity of syllable structures, stress rules, reduction processes, and so forth.
A number of systems are (commercially) available that provide multi-lingual speech output (e.g. DECTalk, Infovox, Multivox, Apollo). Generally, such systems were primarily developed for one language (American English, Swedish, Hungarian, and British English, respectively), and additional language modules were derived from the original language by making minimal changes to the basic units and rules. As a result it is commonly observed that the derivate languages of multi-lingual systems sound poorer than the original. Yet, it is very difficult to establish this convincingly, since the poorer performance may be due (completely or in part) to the greater intrinsic difficulty of the sound system of the new language. Ultimately one would like to develop speech output assessment techniques that allow us to determine the quality of a system speaking language A and to compare its quality to that of another system speaking language B. In order to reach this objective, we would have to know how to weigh the scores obtained for a language for the intrinsic difficulty or complexity of the relevant aspects in that language.
Such goals will not easily be accomplished. However, steps have been taken in the SAM project to ensure optimal cross-language comparability in the construction of the test materials and administration procedures. For example, in the Semantically Unpredictable Sentence Test (SUS Test, see Appendix 1 G), the same five syntactic structures (defined as linear sequences of functional parts of speech, e.g. Subject--Verb--Direct Object) are used in all languages tested, and words are substituted in each of the designated syntactic slots that are selected from the same lexical categories, and with the shortest word length allowed by the language (see G in Appendix 1). It should be obvious, however, that complete congruence cannot be obtained in this fashion: the shortest content words in Italian and Spanish are typically disyllables, whilst they are monosyllabic in the Germanic languages. Similarly, although all five syntactic structures occur in each of the languages tested, certain structures will be more common in one language than in an other. Given the existence of such intrinsic and unavoidable structural differences between languages, we recommend further research into the development of valid cross-language normalisation measures.
Especially when working within the European Union, with its
increasing number of partner countries and languages, speech output products are likely
to be produced on a multilingual basis. The further development of efficient
testing
procedures that can be validly
used for all relevant languages is a clear priority. Yet, as
we explained in section , we
should not raise our hopes too high in this matter, given the
existence of intrinsic and
unavoidable structural differences between languages. For this
reason we
recommend
parallel research into the development of valid cross-language
normalisation measures, that
will allow us to realistically compare speech output test
results across languages, if the
choice of test materials cannot be balanced in all
relevant
linguistic aspects.
In this effort, ITU recommendation P.85 has potential.
Following this procedure (see section
) a reference grid can be constructed for each (EU)
language. One possible outcome
could be that some languages prove more resistent to
time-frequency warping than
others
(although we hesitate to make any predictions). Be this as it
may, differences in
intelligibility between languages would be effectively
normalised out when we determine
the quality of an output system relative to the reference grid
that is
applicable for the
language being tested.