next up previous contents
Next: Towards a taxonomy Up: Synthesis assessment Previous: Synthesis assessment

Introduction

What are speech output systems?

By a speech output system we mean some artifact, whether a dedicated machine or a computer programme, that produces signals that are intended to be functionally equivalent to speech produced by humans. In the present state of affairs speech output systems generally produce audio signals only, but laboratory systems are being developed that supplement the audio signal with the visual image of the (artificial) talker's face (cf. =1 (

; Benoît 1991) , =1 (

; Benoît et al. 1992) ).gif Audio-visual (or: bi-modal) speech output is more intelligible than audio-only output, especially when the audio channel is of degraded quality. In the body of this chapter we will not be concerned with bi- or multi-modal speech output systems, and concentrate on audio-only output instead. However, comments on the assessment of the visual component of bi-modal speech output have been provided by Benoît for the present chapter, which will be included in Appendix 2.

We exclude from the domain of speech output systems such devices as tape recorders and other, more advanced, systems that output speech on the basis of complete, pre-stored messages (``canned speech'' or ``copy synthesis''), irrespective of the type of coding or information compression used to save storage space. We crucially limit our definition to systems that allow the generation of novel messages, either from scratch (i.e. entirely by rule) or by recombining shorter pre-stored units. This definition also includes hybrid synthesis systems where individually stored words (e.g. digits) are substituted in information slots in a carrier sentence (e.g. in time table consultation services).

It seems to us that two basic types of speech output systems have to be distinguished on the basis of their input, namely text-to-speech (TTS) and concept-to-speech (CTS). Other, more complex, systems combine characteristics of these two.

Why speech output assessment?

In spite of the rapid progress that is being made in the field of speech technology, any speech output system available today can still be spotted for what it is: non-human, a machine. Most older systems will fall through immediately due to their robot-like melody and garbled vowels and consonants. Other, more recently developed synthesis methods using short-segment waveform concatenation techniques such as PSOLA ( =1 (

; Moulines Charpentier 1990) ) yield segmental quality that is very close to human speech ( =1 (

; Portele et al. 1994) ), but still suffer from noticeable defects in matters of melody and timing.

As long as synthetic speech is inferior to human speech, speech output assessment will be a major concern. Speech technology development today is typically evaluation-driven. Large scale speech technology programmes have been launched both in the United States and in Europe (for overviews see =1 (

; O'Malley Caisse 1987) ; =1 (

; Van Bezooijen Pols 1989) ;

=1 (

; Pols 1991) ). Especially in the European Union, with its many official languages, a strong need was felt for output quality assessment methods and standards that can be applied across languages. With this goal in mind the multi-national EU-ESPRIT SAM-project was set up ( =1 (

; Fourcin et al. 1989) ). Later, the EU Expert Advisory Group on Language Engineering Standards (EAGLES) programme started, and included a working group on speech output assessment.

Speech output assessment may be of crucial importance to two interested parties, the systems designers on the one hand, and the prospective buyers and end users (possibly represented by consumer organisations) of the system on the other.

Users of this chapter

This chapter serves the potential needs of several disparate groups of readers. Since not all parts of this chapter will be equally relevant to every reader group, we will identify reader groups, and point out which parts of this chapter, after sections 1--3, have particular relevance for each group. The reader groups, of course, overlap to a large degree with the parties interested in speech output evaluation discussed above, but a more refined classification seems to be in order. We will distinguish the following groups of readers:

Given this division of reader groups we will present two types of recommendation, if and when we can.



next up previous contents
Next: Towards a taxonomy Up: Synthesis assessment Previous: Synthesis assessment



WWW Administrator
Fri May 19 11:53:36 MET DST 1995