Next: Towards a taxonomy Up: Synthesis assessment Previous: Synthesis assessment

Introduction

What are speech output systems?

By a speech output system we mean some artifact, whether a dedicated machine or a computer programme, that produces signals that are intended to be functionally equivalent to speech produced by humans. In the present state of affairs speech output systems generally produce audio signals only, but laboratory systems are being developed that supplement the audio signal with the visual image of the (artificial) talker's face (cf. =1 (

; Benoît 1991) , =1 (

; Benoît et al. 1992) ). Audio-visual (or: bi-modal) speech output is more intelligible than audio-only output, especially when the audio channel is of degraded quality. In the body of this chapter we will not be concerned with bi- or multi-modal speech output systems, and concentrate on audio-only output instead. However, comments on the assessment of the visual component of bi-modal speech output have been provided by Benoît for the present chapter, which will be included in Appendix 2.

We exclude from the domain of speech output systems such devices as tape recorders and other, more advanced, systems that output speech on the basis of complete, pre-stored messages (``canned speech'' or ``copy synthesis''), irrespective of the type of coding or information compression used to save storage space. We crucially limit our definition to systems that allow the generation of novel messages, either from scratch (i.e. entirely by rule) or by recombining shorter pre-stored units. This definition also includes hybrid synthesis systems where individually stored words (e.g. digits) are substituted in information slots in a carrier sentence (e.g. in time table consultation services).

It seems to us that two basic types of speech output systems have to be distinguished on the basis of their input, namely text-to-speech (TTS) and concept-to-speech (CTS). Other, more complex, systems combine characteristics of these two.

Text-to-speech. The majority of speech output systems is driven by text input. These systems convert text printed in normal orthography (generally stored in a computer memory as ASCII codes) to speech. Conventional spelling provides a reasonable indication of what sounds and words have to be output, but typically underrepresents prosodic properties of the message, such as the positions of accents, melody, and temporal organisation. The prosody of an utterance reflects the communicative intentions of the writer of the input text, which by present-day standards are virtually impossible to fully reconstruct from the text (cf. the title of a much-cited article: `Accent is predictable, if you're a mind reader'; =1 (
; Bolinger 1972) ). The reconstruction of the writer's intentions is an implicit part of the so-called linguistic interface, i.e. the first part of most advanced text-to-speech systems. All errors in the linguistic interface may detract from the quality of the output speech, and are therefore a legitimate object of evaluation.
Concept-to-speech. In other types of speech output systems, especially dialogue systems, the communicative intentions are fully specified at the input stage: the system itself determines what message it wants to get across. It may still be the case, of course, that the dialogue system has misconstrued a user's request, and consequently issues an inappropriate response message, but this should not be considered an error on the part of the output system.
Interpreting (or translating) telephony (SL-TRANS, cf. =1 (
; Morimoto et al. 1990) ; JANUS, cf. =1 (
; Waibel et al. 1991) ) and face-to-face spoken dialogue translation (Verbmobil, cf. =1 (
; Wahlster 1993) ) combine characteristics of both TTS and CTS. Interpreting telephony, for instance, a spoken utterance in one language (e.g. Japanese) is decomposed into its linguistic message and its speaker specific properties (e.g. voice characteristics, speed, pitch range). The linguistic message is converted to text, and transmitted. At the receiver end the text is automatically translated into another language (e.g. German) and then converted back to speech in the target language setting the synthesiser's speaker specific parameters such that the personal characteristics of the source speaker are approximated in the output signal. Crucially, the sender's intentions do not have to be inferred from the textual representation of the message; the intended focus distribution can be reconstructed directly from the properties of the source language speech signal.

Why speech output assessment?

In spite of the rapid progress that is being made in the field of speech technology, any speech output system available today can still be spotted for what it is: non-human, a machine. Most older systems will fall through immediately due to their robot-like melody and garbled vowels and consonants. Other, more recently developed synthesis methods using short-segment waveform concatenation techniques such as PSOLA ( =1 (

; Moulines Charpentier 1990) ) yield segmental quality that is very close to human speech ( =1 (

; Portele et al. 1994) ), but still suffer from noticeable defects in matters of melody and timing.

As long as synthetic speech is inferior to human speech, speech output assessment will be a major concern. Speech technology development today is typically evaluation-driven. Large scale speech technology programmes have been launched both in the United States and in Europe (for overviews see =1 (

; O'Malley Caisse 1987) ; =1 (

; Van Bezooijen Pols 1989) ;

=1 (

; Pols 1991) ). Especially in the European Union, with its many official languages, a strong need was felt for output quality assessment methods and standards that can be applied across languages. With this goal in mind the multi-national EU-ESPRIT SAM-project was set up ( =1 (

; Fourcin et al. 1989) ). Later, the EU Expert Advisory Group on Language Engineering Standards (EAGLES) programme started, and included a working group on speech output assessment.

Speech output assessment may be of crucial importance to two interested parties, the systems designers on the one hand, and the prospective buyers and end users (possibly represented by consumer organisations) of the system on the other.

Designers are intent on improving their speech output systems. However, designers who have grown up with their system are used to all its habits; they are likely to understand its output much better than first-time users, and will overrate its performance level. Less subjective quality assessment techniques are needed in order to determine how well a system performs relative to a benchmark test, or how favourably it compares with a previous edition of the system or with other designers' systems ( comparative testing or performance evaluation). To the extent that a system performs less than perfect, the designer will have to learn which aspect(s) and/or components of the system are flawed. Designers will therefore also be interested in diagnostic evaluation, either by doing detailed error analyses on the test results, or by running component-specific tests.
The needs of systems users (end users and/or systems providers) are different than those of designers but they, too, heavily rely on assessment techniques. Prospective buyers will always have a specific use of their speech output system in mind. Understandably, they will want the simplest, and therefore cheapest, system that satisfies their needs. The buyer (or his consumer organisation) will therefore need an absolute yardstick in order to determine beforehand if the output speech is good enough to get the message across in the given application.

Users of this chapter

This chapter serves the potential needs of several disparate groups of readers. Since not all parts of this chapter will be equally relevant to every reader group, we will identify reader groups, and point out which parts of this chapter, after sections 1--3, have particular relevance for each group. The reader groups, of course, overlap to a large degree with the parties interested in speech output evaluation discussed above, but a more refined classification seems to be in order. We will distinguish the following groups of readers:

Developers of speech output tests
Speech output assessment is an expanding field. New tests come available in rapid succession, so that test developers want to keep abreast with what is new. Test developers will want to know the advantages and disadvantages of tests proposed in the literature, and need to know what requirements will be made to the next generation of tests. This chapter discusses many alternative tests and testing methodologies, and makes recommendations as to what type of tests are more suited for a specific purpose. The chapter will also indicate what direction speech output testing should take in order to meet the testing requirements of the next generation of speech output systems (section ).
Readers in this group will be interested in section on glass box (diagnostic) testing as well as in section on black box testing. When systems are in their early developmental stages, glass box testing will be most relevant; when systems have sufficiently matured, black box tests are in order. Field testing (section ) will generally be deferred until the end user groups and their specific applications are known. Field testing will often be conducted by, or at least in close cooperation with, systems procurers for end users (see below).
The chapter will be especially useful to those readers who are not test developers at this time but aspire at becoming test developers in the near future. These readers include, of course, students at the Ph.D.-level who want to make a career in speech output testing. A second group of readers who are new to the field are developers of speech technology in Eastern Europe and in certain third world countries where computer technology is now widely available at affordable prices, generating an immediate need for the development of speech output systems and tests in the languages of the areas concerned.
Developers of speech output systems
A lot of speech technology research and development takes place in small high-tech companies. The research staffs are often too small to warrant the appointment of a full-time test evaluation expert, so that a lot of diagnostic do-it-yourself testing is going on. Developers who are not evaluation experts themselves will find this chapter a useful source of information. It identifies standard tests and test suites that are readily available for a range of diagnostic purposes (with addresses, contact persons, and literature references listed in appendices at the end of this chapter). The remarks made above with respect to newcomers to the evaluation field apply here as well. This type of reader should concentrate on the glass box approach (section ).
Procurers of speech output systems
At the most user-oriented end of the spectrum, procurers of systems will find our chapter of interest. Procurers, who themselves will more often than not be naive to the field of speech output technology, will not normally be interested in diagnostic testing. They will be looking for a single figure of merit on the basis of which to decide the system that is best for a given range of applications. This reader group is most difficult to deal with since their needs are most divergent. There are no off-the-shelf tests that satisfy their needs. Rather we will provide numerous examples which may serve as guidelines how to go about field testing speech output systems for specific applications. This type of reader should concentrate on those parts of this chapter dealing with black box output testing and field tests (section ).

Given this division of reader groups we will present two types of recommendation, if and when we can.

The first type suggests what decisions can be made in the present situation with what is available today, or can be made available with little effort in the immediate future. These recommendations will be found throughout this chapter, in concise format and numbered.
The second type of recommendation that we will be making, outlines possible courses of test development for the mid and long term. Such recommendations, predominantly aimed at the evaluation experts, will be presented in section at the end of this chapter, in less explicit format; they will not be numbered.

Next: Towards a taxonomy Up: Synthesis assessment Previous: Synthesis assessment

WWW Administrator
Fri May 19 11:53:36 MET DST 1995