In this final section we will consider desired developments in
the field of speech output
testing, and discuss possibilities for
further research on a
more general level than we did in
the preceding sections. The section consists of three parts.
In the first part (section ) we
will be concerned with possibilities of producing more efficient
output testing techniques.
The general, longer-term strategy proposed here is to replace
expensive,
time-intensive
tests (involving human listeners in field situations) by
cheaper, automated tests carried out
in a laboratory setting. In order to make this feasible we
will have to establish the
predictability relationships between the various types
of
tests discussed in section
:
Next, we will discuss (section ) developments that we feel are
needed in the assessment of
linguistic interfaces of speech output systems. Finally, in
section
, we will propose
research for the mid term aimed at improving speech output
evaluation at the acoustic
level in each of the four areas identified:
segmental quality
(section
), prosodic quality
(section
), voice characteristics (section
),
and overall quality (section
).
The ultimate criterion to decide on the quality of speech output resides with the human listener. Speech output assessment is therefore basically a matter of human perception research. It is commonly acknowledged that the human listener is a noisy measurement instrument, which causes output assessment to be a slow and (therefore) expensive undertaking. There are generally felt to be two ways out of this problem. One is to look for assessment procedures which are optimally efficient, i.e. use perception tasks that are least susceptible to observer noise, and that concentrate on a small set of representative materials from which valid generalisations to all other situations can be made. This line of development has been followed for some time, especially by the SAM consortium, and could fruitfully be extended into the next five years.
The second way out is to replace the human observer by a
computer-simulated observer,
i.e. to use automated assessment
methods. Using automated
methods presupposes that we
know exactly how human listeners react to speech output. The
development of objective
methods is therefore necessarily subsequent to the development
of human test methods. In
those areas of auditory
perception where sufficient,
consolidated knowledge has been
assembled, attempts at computer-simulation can be launched
even today, and, in fact, pilot
studies have recently been undertaken that show the
feasibility of objective testing in
selected
areas (see section ). The field will have to reach
agreement on what further
aspects of human perception, relevant
to speech output
assessment, have evolved to the
point that computer-simulation of the human listener can
realistically be undertaken. Once
such areas have been identified, the next step will be to go
ahead and implement them.
Candidates that present themselves for automated testing will be:
As a first approximation, such computer-simulations should be tried for single speaker situations. That is to say, speech output should be compared only with human ideal speech produced by the same talker, pronouncing the same materials. Note that we assume that even allophone systems are based on a single model talker, since it is generally ill-advised to try and find average values over a larger group of speakers to control the synthesiser's parameters ( =1 (
; Loman Boves 1993 : 159 ) ).
Note that since there will always be (slight) differences in timing between speech output and ideal speech, both segmental and melodic assessment will necessarily involve temporal normalisation. The perceptual evaluation of the discrepancies between output speech and ideal should therefore proceed in --- at least --- two separate stages: first the penalty that is incurred by deviating durations will have to be determined, and only then can we meaningfully consider the penalty for deviating segmental quality (likewise for melodic structure).
We advocate a two-pronged approach here. The field should concentrate on developing optimally efficient tests involving human listeners, and at the same time begin to work on the development of perceptual distance estimation procedures that can be used later in automated assessment.
There is a paradox involved in the choice between judgment tasks and functional tests. On the one hand, it could well be argued that a speech output system is adequate if a representative user group judges the system to be adequate for its purpose. Why should the field go to more trouble to improve the system's quality if the users profess to be satisfied? On the other hand, we can predict with near certainty that the users will not be able to estimate precisely the level of adequacy needed for the output system to function smoothly in a concrete application. The relationship between judgments and functional test scores has been studied in the context of segmental quality, but so far similar studies in the field of prosodic quality testing are extant. It would seem a point of immediate concern, therefore, to consider research into the interrelationship between judgments and functional test behaviour, with emphasis on prosodic quality. To what extent do orderings among competing speech output systems, as derived from judgment tests, correspond to orderings derived from functional tests? If we should be able to predict functional test behaviour from judgment test scores, the latter, as a cheaper alternative for functional testing, could be used in all initial stages of speech output assessment. The use of functional testing would then typically be restricted to diagnostic testing.
Generally, one would expect the global quality of a speech output system to be a function of the quality of the various system components. One would like to be able to predict and quantify the overall ratings and global performance measures from the scores on the components through some form of regression analysis. Obviously, if system designers have only limited resources available, they would direct their efforts toward improving the quality of those aspects that contribute most (in terms of regression coefficients) to the overall assessment of his system. We suggest that research be undertaken in order to address this type of question.
There is general agreement by now that laboratory tests such as are available today do not allow a useful prediction of how well a speech output system will perform in a concrete application. A short-term recommendation is, therefore, to develop a field-test generator, along the same lines as the successful test generators for laboratory intelligibility tests (such as the CLID and SUS tests developed by the SAM consortium). The field-test generator should enable the fast compilation of test materials and adequate simulation of a range of application conditions. For this purpose, an adequate cross-section of applications for speech output has to be inventoried and parametrised along such dimensions as (1) type of users (non-cooperative, children, elderly people, non-native language users), (2) specific aspects of the situation in terms of, for instance, noise, reverberation, telephone channel, and (3) secondary tasks.
On a longer-term basis we advocate a more fundamental solution to the problem of field testing. Ideally, of course, one should not have to go to field every time a new application presents itself. Rather, one would like to be able to predict accurately, on the basis of available results of standard laboratory tests (e.g. intelligibility scores and prosodic adequacy profiles) how a speech output system would perform in a concrete field situation. For this to be the case, it will be necessary to have a valid analysis of the field tasks that have to be accomplished. A task profile will have to be drawn up that analyses the demands that carrying out the task (including and excluding listening to speech output) makes on the user, such as attentional load of the primary task, environmental noise, negative influence of fatigue and boredom, physical strain, etc. Accomplishing this type of prediction calls for cooperation between experts in the field of speech quality assessment experts and human factors studies. We recommend exploratory studies along the lines suggested above, based on quantitative task analyses of a few selected applications.
Generally, we feel that the development and testing of the higher-order linguistic modules of speech output systems should be left to language technology experts. A reasonable division of work would be for speech technology to deal with the linguistic modules that are specific to TTS-applications, i.e. text preprocessing and grapheme-phoneme conversion (including stress position, accent placement and boundary marking). Other linguistic tasks, such as morphological analysis and syntactic parsing, are common to other branches of linguistic engineering (e.g. grammar checking, automatic translation), with much more resources and manpower available. However, even if this division of work could be effected, one would like to see attempts made towards early separation of consequential vs. inconsequential errors in word and sentence parsers. Consequential symbolic errors will audibly affect the (quality of the) acoustic output, whereas inconsequential errors are not reflected at the audio level. This means that part of speech output testing should still be concerned with the higher-order linguistics modules.
We would advocate a more detailed analysis of the various tasks a text-preprocessor has to perform, focussing on those classes of difficulties that crop up in any (European) language. Procedures should be devised that automatically extract representative items from large collections of recent text (newspapers) in each of the relevant error categories, so that multi-lingual tests can be set up efficiently. Once the test materials have been selected, the correct solutions to, for instance, expansion problems can be extracted from existing databases, or when missing there, will have to be entered manually.
A short-term recommendation is to develop multi-lingual machine-readable pronouncing dictionaries at the single word level which list permissible variations. Comparisons of algorithmic output with the model transcriptions requires the development of adequate string alignment procedures. Moreover, not all discrepancies found contribute equally to the overall evaluation. Distance metrics should be developed that allow us to express the differences between two segmentally different phonemic transcriptions in terms of meaningful perceptual distance. Recent work done by
=1 (
; Cucchiarini 1993) could serve as a starting point.
The correctness of most symbolic output can only be determined on the basis of connected text at the sentence level. What is dearly needed, therefore, is the availability of large, multi-lingual text corpora with full phonemic annotation, including not only the permissible pronunciation(s) of the words, including the effects of assimilation across word boundaries and stress shifts, but also the indication of accent positions (and degrees of accent), prosodic boundaries (with break indices of various strengths), and some intonation transcription. Moreover, since these corpora will also have to be used for testing morphological and syntactic parsing, hierarchical word and sentence structure should be indicated; or at least provisions should be made for linguists to enter this type of information at a later stage. The development of such corpora, however, is beyond the competence of the speech output subgroup, and is best left to the text corpora subgroup. We refer to the relevant chapters on database development elsewhere in this volume for a report on what has recently been accomplished in this area, and what work has still to be done.
We recommend the development of procedures for strictly modular testing of linguistic interfaces. This means that test materials have to be made available that are specific to each individual module in the linguistic interface. Each module should be given correct input strings, and the correct output string(s) for only the module at hand should be provided. Only in this way can be eliminate the problem of percolating and compounding of errors made by earlier modules. Obviously, such procedures can only be effective if the databases referred to in the previous paragraph contain representations of the correct strings at each of the levels addressed by the various modules.
With the availability of cheap mass memory, the need for highly intricate linguistic interfaces is less strongly felt than some years before. Rather than computing the phonological code that is to be fed to the acoustic modules, the correct code is simply looked up in large lexicons included in the speech output system. If this trend continues, the emphasis of our research efforts will shift from rule development (and testing) to collecting databases. Database collection and annotation will take place regardless of the direction that the field takes in this matter. If choices have to be made, money is spent most safely on the development of corpora, but only if a multi-lingual notation format can be found that can be used for the transcription of segments and prosodic features of all languages dealt with.
Although less important at the isolated word level, it will remain necessary to test grapheme-phoneme conversion. The output of post-lexical rules (changing the pronunciation of words in connected speech, e.g. through assimilation). Also, testing grapheme-phoneme conversion will remain applicable in the development of cheap speech output systems (such as Multivox and Apollo), which do not access large lexicons not perform sophisticated linguistic analyses of the input text.
With a few provisos (see below) there is general consensus
that the procedures for testing
segmental quality of speech output systems are more or
less
fully developed (cf. section
under DRT/MRT, CLID and SUS Tests). Under the auspices of the
SAM
consortium,
efficient test generators have been developed that enable the
construction of a large variety
of tests that allow quick standardised administration and data
analysis of consonant and
vowel intelligibility scores, both for isolated
word
intelligibility and for intelligibility of
words in (semantically unpredictable) context. These tools
will be very useful in the testing
of even the latest generation of parametric synthesisers.
However, the upcoming
generation of waveform
synthesisers (PSOLA based) will have
segmental quality that will
be tough to discriminate from human speech. Though it may be
possible to further refine
the discriminatory power of our test procedures, one may well
wonder what purpose
would be served by
such endeavors. A reasonable alternative
view would be to consider
the quality of waveform concatenation speech output equivalent
to the human ideal (if
indeed the test shows that no intelligibility difference
remains) and leave the matter at
that.
A short-term recommendation that should be made, concerns the quality of segments in unstressed syllables. It has rightfully been pointed out by, for instance, =1 (
; Van Santen 1993) that most segmental quality tests consider monosyllabic (or ``minisyllabic'') words only. There is a risk involved here that insufficient attention is being paid to the quality of unstressed syllables in longer words. The same, of course, holds true of the quality assessment of (unstressable) function words. Unstressed syllables are generally reduced in human speech, and synthesis-by-rule systems have often neglected to carefully model the reduction processes. In concatenative synthesis, the problem of unstressed syllables can be solved by enlarging the set of acoustic normally unreduced building blocks with a parallel set of reduced building blocks (cf. =1 (
; Drullman Collier 1993) ). The testing problem that crops up in this connection presents an important perceptual question addressing the interaction between segmental and prosodic quality: if unstressed syllables are overarticulated, as would be the case when the reduction processes are not adequately modelled in our synthesis, does the resulting speech output get more intelligible, or does its intelligibility deteriorate? One might predict that, although the identifiability of each individual segment may decrease when reduction is truthfully mimicked, the overall intelligibility, in terms of word scores, will increase, reasoning that the rhythmic structure of words showing natural gradation of strong and weak syllables might be more important to word identifiability than optimal identifiability of each individual phoneme.
On a more general note, we suggest that serious attention be paid to differences in the contribution made to the overall intelligibility of words by the various constituent segments. It is important that we learn to what extent word intelligibility depends on identifying vowels versus consonants, in stressed versus unstressed syllables, in onset, medial, and final position, in short and longer words. Psycholinguistic studies on auditory word recognition have shown that, indeed, stressed segments --- because of their greater inherent loudness and duration --- have a better chance of contributing to the recognition process, as do segments early in the word. Ideally, we would like to be able to predict the intelligibility of an arbitrary selection of words from the lexicon of a language, just by looking at the identification scores of the constituent vowels and consonants in unpredictable words (i.e. segment strings that are phonotactically legal and may be lexical words or nonsense strings).
With the advent of high-quality segmental speech output (section
) a shift from segmental
quality testing to prosody seems immanent. It is obvious that
there is still a long way
ahead of us before the evaluation of prosody will get
full
coverage. What is needed is a
careful taxonomy of prosodic functions at all linguistic and
pragmatic levels. We suggest,
therefore, that the first priority should be to chart out all
the prosodic functions relevant to
human-machine communication.
We need to know not only what
functions are fulfilled by
prosody, but also what the communicative importance of each
specific function is (if any).
Once a reasonably complete view of relevant prosodic functions
has been obtained,
attempts should be made
at defining adequate tests in order to
determine to what extent
each function is expressed by the speech output system.
It will be difficult to separate the evaluation of prosodic forms from their communicative functions, and perhaps such a dissociation is not even necessary. It seems reasonable to assume that a prosodic feature fulfills its communicative function better as its formal properties are closer to the human model. If this relationship holds, we would not have to test the formal adequacy of speech timing and melody rules in the abstraction of their communicative functions. Once we know the communicative function of each formal prosodic distinction, the prosodic quality of speech output systems can be measured by the effectiveness with which each of the communicative functions is signalled to the human listener. For these reasons we suggested that functional testing of prosody be given priority. Whatever audible flaws remain after the communicative functions have been shown to be signalled as effectively as in human speech, will have to be addressed in a later stage, using judgment tasks.
We propose that the emphasis should be on the functions of prosody, rather than on the details of prosodic form. Our point of departure, for the time being, is that the formal aspects of prosody cannot be too far off the mark if the prosodic functions are all adequately fulfilled. This should not be interpreted in the sense that we consider the details of prosodic form (such as exact pitch movements and timing) unimportant. In fact, there is every reason to believe that prosodic functions such as accentuation are only adequately expressed by very narrowly defined (in terms of direction, excursion size, and segmental alignment) language-specific pitch movements. In this context it seems obvious that adequate prosodic functioning can only be guaranteed if speech output systems are capable of synthesising not only accents and boundaries, but also more subtle degrees within such categories. For instance, the adequacy of prosodic boundary markings should be tested at least at four levels of depth: strong and weaker boundaries within the sentence, as well as sentence and paragraph boundaries, which are signalled in parallel by melody, temporal organisation, and (possibly even) intensity.
Generally, we believe that the identification of prosodic
functions (including the expression
of emotion) to be tested presents a greater problem than
devising
tests to determine the
functional adequacy of prosody once a particular function has
been identified. Still,
choices will have to made as to what particular test
methodology to adopt. We propose
that a pilot study be initiated to examine the pros and
cons
of the various tests used in the
experimental phonetic and psycholinguistic literature (as
outlined in section ) that seem
relevant to this matter.
As a consequence of claiming priority for prosodic functions, the development of (multi-lingual) prosodic form tests (and test generators) should be postponed until some later stage.
It would appear that the evaluation of voice quality is going
to be a matter of increasing
concern.
Developers of personalised voice speech output will
need test procedures in order
to determine how convincingly their systems mimic the quality
of the model's voice.
Simple same-different testing ( Is it Ella? Or is it
Memorex?) will not do,
since
developers will need the evaluation as a diagnostic tool. We
suggest that a test tool be
developed that enables the efficient drawing up of voice
quality profiles (cf. section ).
Apart from the development of personalised voice synthesis, the voice quality of general purpose speech output systems will get a lot more attention in the coming decade. With the improvement of segmental, and to a lesser extent, prosodic quality of speech output, the need for more natural and pleasant voice quality will be strongly felt. It will be a concern for the evaluation field to develop test procedures in order to determine the appropriateness of voice quality for speech output in general and for specific applications (e.g. alert messages).
Now that the quality of speech
output systems gets closer to
that of human speech,
assessment should concentrate on other aspects of quality
testing than linguistic functions.
Synthetic speech may be virtually equivalent to human speech
in all aspects, and still be
lacking in certain
subtle qualities. This aspect of speech
output testing should be
considered in a special study, looking at the effects of
listening to synthetic speech in
terms of fatigue and allocation of attention to secondary
tasks (cf. section ). The
development of efficient multi-lingual test generators
addressing this aspect would be a
welcome addition to
our arsenal.