Next: List of recommendations Up: Synthesis assessment Previous: Glass box approach

Further developments in speech output testing

Introduction

In this final section we will consider desired developments in the field of speech output testing, and discuss possibilities for further research on a more general level than we did in the preceding sections. The section consists of three parts. In the first part (section ) we will be concerned with possibilities of producing more efficient output testing techniques. The general, longer-term strategy proposed here is to replace expensive, time-intensive tests (involving human listeners in field situations) by cheaper, automated tests carried out in a laboratory setting. In order to make this feasible we will have to establish the predictability relationships between the various types of tests discussed in section :

How to predict human responses to output systems from automated (computer-simulated) listeners? (section )
How to predict overall performance assessment from evaluation results of system modules? (section )
How to predict functional performance of a system from judgment testing? (section )
How to predict field performance of an output system from laboratory testing? (section )

Next, we will discuss (section ) developments that we feel are needed in the assessment of linguistic interfaces of speech output systems. Finally, in section , we will propose research for the mid term aimed at improving speech output evaluation at the acoustic level in each of the four areas identified: segmental quality (section ), prosodic quality (section ), voice characteristics (section ), and overall quality (section ).

Long-term strategy: Towards predictive tests

From human to automated testing

The ultimate criterion to decide on the quality of speech output resides with the human listener. Speech output assessment is therefore basically a matter of human perception research. It is commonly acknowledged that the human listener is a noisy measurement instrument, which causes output assessment to be a slow and (therefore) expensive undertaking. There are generally felt to be two ways out of this problem. One is to look for assessment procedures which are optimally efficient, i.e. use perception tasks that are least susceptible to observer noise, and that concentrate on a small set of representative materials from which valid generalisations to all other situations can be made. This line of development has been followed for some time, especially by the SAM consortium, and could fruitfully be extended into the next five years.

The second way out is to replace the human observer by a computer-simulated observer, i.e. to use automated assessment methods. Using automated methods presupposes that we know exactly how human listeners react to speech output. The development of objective methods is therefore necessarily subsequent to the development of human test methods. In those areas of auditory perception where sufficient, consolidated knowledge has been assembled, attempts at computer-simulation can be launched even today, and, in fact, pilot studies have recently been undertaken that show the feasibility of objective testing in selected areas (see section ). The field will have to reach agreement on what further aspects of human perception, relevant to speech output assessment, have evolved to the point that computer-simulation of the human listener can realistically be undertaken. Once such areas have been identified, the next step will be to go ahead and implement them.

Candidates that present themselves for automated testing will be:

Segmental quality: Here the problems are not unlike those that have to be solved in automatic speech recognition. In both cases automatic procedures have to be developed that accept/recognise speech as long as the stimulus is within certain prespecified ranges of tolerance, i.e. gets close enough to some prespecified ideal. So, common to both enterprises is the element that we have to be able to determine the (perceptual) distance between input and ideal. However, in automatic speech recognition the tolerances will have to be set as loosely as possible, so as to maximise the chances of deviating exemplars of sounds/words to be recognised. In speech output assessment, ideally, the same perceptual distance metrics can be used. Generally, the closer the speech output hits the ideal norms, the better its quality. Perceptual distance assessment will then become equivalent to quality assessment. Moreover, the results can be used not only at the level of overall assessment (e.g. in terms of some weighed mean perceptual distance), but also for diagnostic purposes: when specific sounds are poorly synthesised, they will show up as local maxima in the plots of perceptual distance as a function of time.
As a first approximation, such computer-simulations should be tried for single speaker situations. That is to say, speech output should be compared only with human ideal speech produced by the same talker, pronouncing the same materials. Note that we assume that even allophone systems are based on a single model talker, since it is generally ill-advised to try and find average values over a larger group of speakers to control the synthesiser's parameters ( =1 (
; Loman Boves 1993 : 159 ) ).
Temporal structure: Using the same principles as above, the discrepancies in temporal organisation between system output speech and a human ideal can be determined (e.g. through Dynamic Time Warping).
Melodic structure: Using the same principles as above, perceptual distance between speech output melody and melodic norm can be determined.
Note that since there will always be (slight) differences in timing between speech output and ideal speech, both segmental and melodic assessment will necessarily involve temporal normalisation. The perceptual evaluation of the discrepancies between output speech and ideal should therefore proceed in --- at least --- two separate stages: first the penalty that is incurred by deviating durations will have to be determined, and only then can we meaningfully consider the penalty for deviating segmental quality (likewise for melodic structure).

We advocate a two-pronged approach here. The field should concentrate on developing optimally efficient tests involving human listeners, and at the same time begin to work on the development of perceptual distance estimation procedures that can be used later in automated assessment.

Predicting functional behaviour from judgment testing

There is a paradox involved in the choice between judgment tasks and functional tests. On the one hand, it could well be argued that a speech output system is adequate if a representative user group judges the system to be adequate for its purpose. Why should the field go to more trouble to improve the system's quality if the users profess to be satisfied? On the other hand, we can predict with near certainty that the users will not be able to estimate precisely the level of adequacy needed for the output system to function smoothly in a concrete application. The relationship between judgments and functional test scores has been studied in the context of segmental quality, but so far similar studies in the field of prosodic quality testing are extant. It would seem a point of immediate concern, therefore, to consider research into the interrelationship between judgments and functional test behaviour, with emphasis on prosodic quality. To what extent do orderings among competing speech output systems, as derived from judgment tests, correspond to orderings derived from functional tests? If we should be able to predict functional test behaviour from judgment test scores, the latter, as a cheaper alternative for functional testing, could be used in all initial stages of speech output assessment. The use of functional testing would then typically be restricted to diagnostic testing.

Predicting global from analytic testing

Generally, one would expect the global quality of a speech output system to be a function of the quality of the various system components. One would like to be able to predict and quantify the overall ratings and global performance measures from the scores on the components through some form of regression analysis. Obviously, if system designers have only limited resources available, they would direct their efforts toward improving the quality of those aspects that contribute most (in terms of regression coefficients) to the overall assessment of his system. We suggest that research be undertaken in order to address this type of question.

Predicting field performance from laboratory testing

There is general agreement by now that laboratory tests such as are available today do not allow a useful prediction of how well a speech output system will perform in a concrete application. A short-term recommendation is, therefore, to develop a field-test generator, along the same lines as the successful test generators for laboratory intelligibility tests (such as the CLID and SUS tests developed by the SAM consortium). The field-test generator should enable the fast compilation of test materials and adequate simulation of a range of application conditions. For this purpose, an adequate cross-section of applications for speech output has to be inventoried and parametrised along such dimensions as (1) type of users (non-cooperative, children, elderly people, non-native language users), (2) specific aspects of the situation in terms of, for instance, noise, reverberation, telephone channel, and (3) secondary tasks.

On a longer-term basis we advocate a more fundamental solution to the problem of field testing. Ideally, of course, one should not have to go to field every time a new application presents itself. Rather, one would like to be able to predict accurately, on the basis of available results of standard laboratory tests (e.g. intelligibility scores and prosodic adequacy profiles) how a speech output system would perform in a concrete field situation. For this to be the case, it will be necessary to have a valid analysis of the field tasks that have to be accomplished. A task profile will have to be drawn up that analyses the demands that carrying out the task (including and excluding listening to speech output) makes on the user, such as attentional load of the primary task, environmental noise, negative influence of fatigue and boredom, physical strain, etc. Accomplishing this type of prediction calls for cooperation between experts in the field of speech quality assessment experts and human factors studies. We recommend exploratory studies along the lines suggested above, based on quantitative task analyses of a few selected applications.

Linguistic testing: Creating test environments for linguistic interfaces

Generally, we feel that the development and testing of the higher-order linguistic modules of speech output systems should be left to language technology experts. A reasonable division of work would be for speech technology to deal with the linguistic modules that are specific to TTS-applications, i.e. text preprocessing and grapheme-phoneme conversion (including stress position, accent placement and boundary marking). Other linguistic tasks, such as morphological analysis and syntactic parsing, are common to other branches of linguistic engineering (e.g. grammar checking, automatic translation), with much more resources and manpower available. However, even if this division of work could be effected, one would like to see attempts made towards early separation of consequential vs. inconsequential errors in word and sentence parsers. Consequential symbolic errors will audibly affect the (quality of the) acoustic output, whereas inconsequential errors are not reflected at the audio level. This means that part of speech output testing should still be concerned with the higher-order linguistics modules.

We would advocate a more detailed analysis of the various tasks a text-preprocessor has to perform, focussing on those classes of difficulties that crop up in any (European) language. Procedures should be devised that automatically extract representative items from large collections of recent text (newspapers) in each of the relevant error categories, so that multi-lingual tests can be set up efficiently. Once the test materials have been selected, the correct solutions to, for instance, expansion problems can be extracted from existing databases, or when missing there, will have to be entered manually.

A short-term recommendation is to develop multi-lingual machine-readable pronouncing dictionaries at the single word level which list permissible variations. Comparisons of algorithmic output with the model transcriptions requires the development of adequate string alignment procedures. Moreover, not all discrepancies found contribute equally to the overall evaluation. Distance metrics should be developed that allow us to express the differences between two segmentally different phonemic transcriptions in terms of meaningful perceptual distance. Recent work done by

=1 (

; Cucchiarini 1993) could serve as a starting point.

The correctness of most symbolic output can only be determined on the basis of connected text at the sentence level. What is dearly needed, therefore, is the availability of large, multi-lingual text corpora with full phonemic annotation, including not only the permissible pronunciation(s) of the words, including the effects of assimilation across word boundaries and stress shifts, but also the indication of accent positions (and degrees of accent), prosodic boundaries (with break indices of various strengths), and some intonation transcription. Moreover, since these corpora will also have to be used for testing morphological and syntactic parsing, hierarchical word and sentence structure should be indicated; or at least provisions should be made for linguists to enter this type of information at a later stage. The development of such corpora, however, is beyond the competence of the speech output subgroup, and is best left to the text corpora subgroup. We refer to the relevant chapters on database development elsewhere in this volume for a report on what has recently been accomplished in this area, and what work has still to be done.

We recommend the development of procedures for strictly modular testing of linguistic interfaces. This means that test materials have to be made available that are specific to each individual module in the linguistic interface. Each module should be given correct input strings, and the correct output string(s) for only the module at hand should be provided. Only in this way can be eliminate the problem of percolating and compounding of errors made by earlier modules. Obviously, such procedures can only be effective if the databases referred to in the previous paragraph contain representations of the correct strings at each of the levels addressed by the various modules.

With the availability of cheap mass memory, the need for highly intricate linguistic interfaces is less strongly felt than some years before. Rather than computing the phonological code that is to be fed to the acoustic modules, the correct code is simply looked up in large lexicons included in the speech output system. If this trend continues, the emphasis of our research efforts will shift from rule development (and testing) to collecting databases. Database collection and annotation will take place regardless of the direction that the field takes in this matter. If choices have to be made, money is spent most safely on the development of corpora, but only if a multi-lingual notation format can be found that can be used for the transcription of segments and prosodic features of all languages dealt with.

Although less important at the isolated word level, it will remain necessary to test grapheme-phoneme conversion. The output of post-lexical rules (changing the pronunciation of words in connected speech, e.g. through assimilation). Also, testing grapheme-phoneme conversion will remain applicable in the development of cheap speech output systems (such as Multivox and Apollo), which do not access large lexicons not perform sophisticated linguistic analyses of the input text.

Acoustic testing: Developments for the near future

Segmental quality testing

With a few provisos (see below) there is general consensus that the procedures for testing segmental quality of speech output systems are more or less fully developed (cf. section under DRT/MRT, CLID and SUS Tests). Under the auspices of the SAM consortium, efficient test generators have been developed that enable the construction of a large variety of tests that allow quick standardised administration and data analysis of consonant and vowel intelligibility scores, both for isolated word intelligibility and for intelligibility of words in (semantically unpredictable) context. These tools will be very useful in the testing of even the latest generation of parametric synthesisers. However, the upcoming generation of waveform synthesisers (PSOLA based) will have segmental quality that will be tough to discriminate from human speech. Though it may be possible to further refine the discriminatory power of our test procedures, one may well wonder what purpose would be served by such endeavors. A reasonable alternative view would be to consider the quality of waveform concatenation speech output equivalent to the human ideal (if indeed the test shows that no intelligibility difference remains) and leave the matter at that.

A short-term recommendation that should be made, concerns the quality of segments in unstressed syllables. It has rightfully been pointed out by, for instance, =1 (

; Van Santen 1993) that most segmental quality tests consider monosyllabic (or ``minisyllabic'') words only. There is a risk involved here that insufficient attention is being paid to the quality of unstressed syllables in longer words. The same, of course, holds true of the quality assessment of (unstressable) function words. Unstressed syllables are generally reduced in human speech, and synthesis-by-rule systems have often neglected to carefully model the reduction processes. In concatenative synthesis, the problem of unstressed syllables can be solved by enlarging the set of acoustic normally unreduced building blocks with a parallel set of reduced building blocks (cf. =1 (

; Drullman Collier 1993) ). The testing problem that crops up in this connection presents an important perceptual question addressing the interaction between segmental and prosodic quality: if unstressed syllables are overarticulated, as would be the case when the reduction processes are not adequately modelled in our synthesis, does the resulting speech output get more intelligible, or does its intelligibility deteriorate? One might predict that, although the identifiability of each individual segment may decrease when reduction is truthfully mimicked, the overall intelligibility, in terms of word scores, will increase, reasoning that the rhythmic structure of words showing natural gradation of strong and weak syllables might be more important to word identifiability than optimal identifiability of each individual phoneme.

On a more general note, we suggest that serious attention be paid to differences in the contribution made to the overall intelligibility of words by the various constituent segments. It is important that we learn to what extent word intelligibility depends on identifying vowels versus consonants, in stressed versus unstressed syllables, in onset, medial, and final position, in short and longer words. Psycholinguistic studies on auditory word recognition have shown that, indeed, stressed segments --- because of their greater inherent loudness and duration --- have a better chance of contributing to the recognition process, as do segments early in the word. Ideally, we would like to be able to predict the intelligibility of an arbitrary selection of words from the lexicon of a language, just by looking at the identification scores of the constituent vowels and consonants in unpredictable words (i.e. segment strings that are phonotactically legal and may be lexical words or nonsense strings).

Prosodic quality testing

With the advent of high-quality segmental speech output (section ) a shift from segmental quality testing to prosody seems immanent. It is obvious that there is still a long way ahead of us before the evaluation of prosody will get full coverage. What is needed is a careful taxonomy of prosodic functions at all linguistic and pragmatic levels. We suggest, therefore, that the first priority should be to chart out all the prosodic functions relevant to human-machine communication. We need to know not only what functions are fulfilled by prosody, but also what the communicative importance of each specific function is (if any). Once a reasonably complete view of relevant prosodic functions has been obtained, attempts should be made at defining adequate tests in order to determine to what extent each function is expressed by the speech output system.

It will be difficult to separate the evaluation of prosodic forms from their communicative functions, and perhaps such a dissociation is not even necessary. It seems reasonable to assume that a prosodic feature fulfills its communicative function better as its formal properties are closer to the human model. If this relationship holds, we would not have to test the formal adequacy of speech timing and melody rules in the abstraction of their communicative functions. Once we know the communicative function of each formal prosodic distinction, the prosodic quality of speech output systems can be measured by the effectiveness with which each of the communicative functions is signalled to the human listener. For these reasons we suggested that functional testing of prosody be given priority. Whatever audible flaws remain after the communicative functions have been shown to be signalled as effectively as in human speech, will have to be addressed in a later stage, using judgment tasks.

We propose that the emphasis should be on the functions of prosody, rather than on the details of prosodic form. Our point of departure, for the time being, is that the formal aspects of prosody cannot be too far off the mark if the prosodic functions are all adequately fulfilled. This should not be interpreted in the sense that we consider the details of prosodic form (such as exact pitch movements and timing) unimportant. In fact, there is every reason to believe that prosodic functions such as accentuation are only adequately expressed by very narrowly defined (in terms of direction, excursion size, and segmental alignment) language-specific pitch movements. In this context it seems obvious that adequate prosodic functioning can only be guaranteed if speech output systems are capable of synthesising not only accents and boundaries, but also more subtle degrees within such categories. For instance, the adequacy of prosodic boundary markings should be tested at least at four levels of depth: strong and weaker boundaries within the sentence, as well as sentence and paragraph boundaries, which are signalled in parallel by melody, temporal organisation, and (possibly even) intensity.

Generally, we believe that the identification of prosodic functions (including the expression of emotion) to be tested presents a greater problem than devising tests to determine the functional adequacy of prosody once a particular function has been identified. Still, choices will have to made as to what particular test methodology to adopt. We propose that a pilot study be initiated to examine the pros and cons of the various tests used in the experimental phonetic and psycholinguistic literature (as outlined in section ) that seem relevant to this matter.

As a consequence of claiming priority for prosodic functions, the development of (multi-lingual) prosodic form tests (and test generators) should be postponed until some later stage.

Voice characteristics testing

It would appear that the evaluation of voice quality is going to be a matter of increasing concern. Developers of personalised voice speech output will need test procedures in order to determine how convincingly their systems mimic the quality of the model's voice. Simple same-different testing ( Is it Ella? Or is it Memorex?) will not do, since developers will need the evaluation as a diagnostic tool. We suggest that a test tool be developed that enables the efficient drawing up of voice quality profiles (cf. section ).

Apart from the development of personalised voice synthesis, the voice quality of general purpose speech output systems will get a lot more attention in the coming decade. With the improvement of segmental, and to a lesser extent, prosodic quality of speech output, the need for more natural and pleasant voice quality will be strongly felt. It will be a concern for the evaluation field to develop test procedures in order to determine the appropriateness of voice quality for speech output in general and for specific applications (e.g. alert messages).

Overall quality testing

Now that the quality of speech output systems gets closer to that of human speech, assessment should concentrate on other aspects of quality testing than linguistic functions. Synthetic speech may be virtually equivalent to human speech in all aspects, and still be lacking in certain subtle qualities. This aspect of speech output testing should be considered in a special study, looking at the effects of listening to synthetic speech in terms of fatigue and allocation of attention to secondary tasks (cf. section ). The development of efficient multi-lingual test generators addressing this aspect would be a welcome addition to our arsenal.

Next: List of recommendations Up: Synthesis assessment Previous: Glass box approach

WWW Administrator
Fri May 19 11:53:36 MET DST 1995