To structure our overview of TTS assessment tests we will discuss a number of useful distinguishing parameters, which partly overlap with earlier attempted taxonomies (see e.g.
=1 (
; Van Bezooijen Pols 1990) ; =1 (
; Pols 1991) ; =1 (
; Jekosch Pols 1994) ;
=1 (
;
Goldstein 1995)
) and
explain the relationships among them, before dealing with any
specific assessment
techniques. Figure illustrates the relationships among the
various dichotomies that make
up our taxonomy. It will be apparent from
the figure that the
dichotomies are non-orthogonal. For instance, tests that have been developed to
evaluate specific modules within a
speech output system (glass box approach), will only be used
in a laboratory setting. The
subdivision of human
listener tasks will only be relevant when
tests involve human
listeners; therefore no task variables will be specified under
automated test techniques. In
the subsequent paragraphs we will outline and discuss the
taxonomy of speech output
assessment
techniques following as much as possible the
structure of Figure
. Note that
the dichotomies used are intended
as contrastive dimensions so
as to create a multi-dimensional space within which speech output tests can be
located. The terms involved in
any dichotomy should not be construed as labels identifying
mutually exclusive approaches
to speech output
evaluation.
Figure: Relationships among dimensions involved in a
taxonomy of speech output
evaluation methods. Any
path from the root down to any
terminal that does not cross a
horizontal gap, constitutes a meaningful combination of test
attributes.
The levels in Figure are dealt with in the following sections:
Text-to-speech systems generally comprise a range of modules that take care of specific tasks. The first module (or complex of modules) converts an orthographic input string to some abstract linguistic code that is explicit in its representation of sounds and prosodic markers. Various modules then act upon this symbolic representation. Typically, one module concatenates the primitive building blocks (phonemes, diphones) in their appropriate order, another implements what coarticulation is needed to obtain smooth human-like transitions between successive building blocks. Prosodic modules, taking the positions of word stresses, sentence accents, phrasal and sentence boundaries into account, are then called upon in order to provide an appropriate temporal organisation (local accelerations and decelerations, pauses) and speech melody.
End users will typically be interested in the performance of a system as a whole. They will consider the system as a black box that accepts text and outputs speech, a monolith without any internal structure. For them it is only the quality of the output speech that matters. In this way systems developed by different manufacturers can be compared or the improvement of one system relative to an earlier edition can be traced over time ( comparative testing). However, if the output is less than optimal it will not be possible to pinpoint the exact module or modules that caused the problem. For diagnostic purposes, therefore, designers often set up their evaluations in a more experimental (`` glass box'') way. Keeping the effects of all modules but one constant, and systematically varying the characteristics of the free module, any difference in the assessment of the system's output must be caused by the variations in the target module. Glass box testing, of course, presupposes that the researcher has control over the input and output of each individual module.
Recommendation 1
Use a glass box approach if you want
diagnostics in
order to improve your speech
output system.
Recommendation 2
Use a black box approach if you want to assess the
overall performance of speech
output systems.
The dichotomy between glass box and black box testing is basic to speech output testing, which has led some researchers to propose a strict terminological division whereby ``evaluation'' signifies glass box testing (or: diagnostic evaluation) only, and ``assessment'' is reserved exclusively for black box testing (or: performance evaluation). In this chapter we will use the terms ``evaluation'' and ``assessment'' indiscriminately, and use disambiguating adjectives whenever there is a risk of confusion.
Ideally, any speech output system should perform at the same level of adequacy as a human speaker. Such a system would be optimal for any application. However, given that systems available today are less than optimal, it is important to know which aspects of a system's performance are essential to a specific application. Speech output systems typically form an element of a larger human-machine interface in an application with a specific, dedicated task. In practice this means that, quite probably, the vocabulary and types of information exchanges are restricted and domain-specific, so that situational redundancy is likely to make up for poor intelligibility. On the other hand, speech output systems will often be used in complex information processing tasks, so that the listener has only limited resources available for attending to the speech input. Also, end users may have different attitudes towards, and motivations for, working with artificial speech than subjects in laboratory experiments, especially when the latter have not been explicitly selected so as to be fully representative of the end users. It is often hazardous, therefore, to predict beforehand, on the basis of laboratory tests, how successful a speech output system will be in the practical application. Generally, as an application situation contains more specific aspects, less prediction of field performance is afforded by laboratory tests. Output systems will have to be tested in the field, i.e. in the real situation, with the real users. The use of field tests will be limited to one system in one specific application; results of a field test cannot, as a rule, be generalised to other systems and/or other applications.
Recommendation 3
Do not rely on laboratory tests. As soon as there is a
discrepancy between the
laboratory setting and the true field situation (in terms of
environment, tasks, type of
listener) field testing is necessary.
The more complex TTS systems can roughly be divided into a linguistic interface that transforms spelling into an abstract phonological code, and an acoustical interface that transduces this symbolic representation to an audible waveform. The quality of the intermediary representation can be tested directly at the symbolic-linguistic level or indirectly at the level of the acoustic output. Testing the audio output has the advantage that only errors in the symbolic representation that have audible consequences, will affect the evaluation. The disadvantage of audio testing is that it involves the use of human listeners, and is therefore costly and time-consuming. Moreover, the results of acoustic testing are unspecific in that the designer is not informed whether the problems originate at the linguistic or at the acoustic level. As an alternative the intermediate representations in the linguistic interface are often evaluated at the symbolic level. It is, of course, a relatively easy task to compare the symbolic output of a linguistic module with some pre-stored key or model representation and determine the discrepancies, and this is what is normally done. The non-trivial problem is where to obtain the model representations. These will generally have to be compiled manually (or semi-automatically at best), and often involve multiple correct solutions.
In the large majority of test procedures human subjects are called upon in order to determine the quality of a speech output system. This should come as no surprise to us, since the end user of a speech output system is a human listener. However, there are certain drawbacks inherent to the use of human subjects. Firstly, humans, whether acting as single individuals or collectively as a group, are always somewhat noisy in their judgments or task performance, i.e. the results of tests involving human responses are never perfectly reliable in the statistical, psychometric sense of the word. Another drawback of tests involving human subjects it that they are time-consuming and therefore expensive to run.
Recent developments, which are still very much in the laboratory stage, seek to replace human evaluation by automatic assessment of speech output systems or modules thereof. Attempts can be (and in fact have been) made to automatically measure the discrepancy in acoustical terms between a system's output and the speech of the human speaker that serves as the model the system is intended to imitate. This is the type of evaluation technique that one would ultimately want to come up with: the use of human listeners is avoided, so that perfectly reproducible noiseless results can be obtained in as little time as it takes a computer to execute the program. At the same time, however, it will be clear that implementation of such techniques as a substitute for human listeners presupposes that we know exactly how human listeners evaluate differences between two realisations of the same linguistic message. Unfortunately, this type of knowledge is largely lacking at the moment; filling the gap would be a research priority. Nevertheless, preliminary automatic comparisons of synthetic and human speech output have been undertaken in the fields of melody and pause distribution ( =1 (
; Barry et al. 1989) ), long term average spectral characteristics ( =1 (
; Pavlovic Rossi Espesser 1991) ) and dynamics of speech in the frequency and time domains ( =1 (
; Houtgast Verhave 1991) , =1 (
;
Houtgast Verhave 1992)
).
Generally, the results
obtained through these techniques show sufficient promise to
warrant extension of their
scope . We will come back to the possibilities of
automated testing in section
.
By judgment testing (also called opinion testing in telecommunication research) we mean a procedure whereby a group of listeners is asked to judge the performance of a speech output system along a number of rating scales. The scales are typically bi-polar adjectives that allow the listeners to express the quality of the output system along a more global or more specific aspect of its performance. Although the construction of an appropriate scaling instrument is by no means a trivial task, a scaling test can be administered with little effort and yields a lot of potentially useful information.
At the other extreme the speech output can be assessed in terms of how well it actually performs its communicative purpose. This is called functional testing. For instance, if we want to know to what extent the output speech is intelligible, we may prefer to measure its intelligibility not by asking listeners how intelligible they think the speech is, but by determining, for instance, whether listeners correctly identify the sounds. Consider, as an example on a higher level of communication, the assessment of an information system using speech output. We may ask users to judge the output quality, but we may also functionally determine the system's adequacy by looking at task completion: how often and how efficiently do the users get the information from the system that they need?
One would hope that the results of judgment and functional assessments converge. Obviously, one would like to use the results of functional assessments in order to gauge the validity of judgments, rather than the other way about. As far as we have been able to ascertain, there has been little research into this matter. Yet, there is at least one set of intersubjective and functional data that was collected for the same group of listeners and stimuli, testing two different text-to-speech systems at three different points in time, from which it appeared that the scaling results were highly correlated with the corresponding functional test scores ( =1 (
; Pavlovic Rossi Espesser 1990) ).
In a sense there is only one ultimate criterion that determines the quality of a speech output system, viz. its overall quality within a given application. Judgment tests usually include one or more rating scales covering such global aspects as ``overall quality'', ``naturalness'' and ``acceptability''. A functional approach to global assessment would be to determine whether users of speech output, when given the choice, choose to work with a machine or with the human original the machine is intended to simulate. Or one may determine if the information exchange is as successful in machine-to-human as it is in human-to-human situations.
On the other hand, one may be interested in determining the quality of specific aspects of a speech output system, in an analytic listening mode, where listeners are requested to pay particular attention to selected aspects of the speech output. Again, both judgment and functional tests can and have been designed addressing the quality of specific aspects of a speech output system. Listeners may be asked, for instance, to rate the clarity of vowels and consonants, the appropriateness of stresses and accents, pleasantness of voice quality, and tempo. Functional tests have been designed to test the intelligibility of individual sounds (phoneme monitoring), of combinations of sounds (syllable monitoring), of whole words (word monitoring) in isolation as well as in various types of context (e.g. =1 (
; Nusbaum Greenspan Pisoni 1986) ); =1 (
; Ralston et al. 1991) ).