By a speech output system we mean some artifact, whether a dedicated machine or a computer programme, that produces signals that are intended to be functionally equivalent to speech produced by humans. In the present state of affairs speech output systems generally produce audio signals only, but laboratory systems are being developed that supplement the audio signal with the visual image of the (artificial) talker's face (cf. =1 (
; Benoît 1991) , =1 (
;
Benoît et al. 1992)
). Audio-visual (or:
bi-modal) speech output is
more
intelligible than audio-only output, especially when the
audio channel is of degraded
quality. In the body of this chapter we will not be concerned
with bi- or multi-modal
speech output systems, and concentrate on audio-only output
instead. However,
comments
on the assessment of the visual component of bi-modal speech
output have been provided
by Benoît for the present chapter, which will be included in
Appendix 2.
We exclude from the domain of speech output systems such devices as tape recorders and other, more advanced, systems that output speech on the basis of complete, pre-stored messages (``canned speech'' or ``copy synthesis''), irrespective of the type of coding or information compression used to save storage space. We crucially limit our definition to systems that allow the generation of novel messages, either from scratch (i.e. entirely by rule) or by recombining shorter pre-stored units. This definition also includes hybrid synthesis systems where individually stored words (e.g. digits) are substituted in information slots in a carrier sentence (e.g. in time table consultation services).
It seems to us that two basic types of speech output systems have to be distinguished on the basis of their input, namely text-to-speech (TTS) and concept-to-speech (CTS). Other, more complex, systems combine characteristics of these two.
; Bolinger 1972) ). The reconstruction of the writer's intentions is an implicit part of the so-called linguistic interface, i.e. the first part of most advanced text-to-speech systems. All errors in the linguistic interface may detract from the quality of the output speech, and are therefore a legitimate object of evaluation.
; Morimoto et al. 1990) ; JANUS, cf. =1 (
; Waibel et al. 1991) ) and face-to-face spoken dialogue translation (Verbmobil, cf. =1 (
; Wahlster 1993) ) combine characteristics of both TTS and CTS. Interpreting telephony, for instance, a spoken utterance in one language (e.g. Japanese) is decomposed into its linguistic message and its speaker specific properties (e.g. voice characteristics, speed, pitch range). The linguistic message is converted to text, and transmitted. At the receiver end the text is automatically translated into another language (e.g. German) and then converted back to speech in the target language setting the synthesiser's speaker specific parameters such that the personal characteristics of the source speaker are approximated in the output signal. Crucially, the sender's intentions do not have to be inferred from the textual representation of the message; the intended focus distribution can be reconstructed directly from the properties of the source language speech signal.
In spite of the rapid progress that is being made in the field of speech technology, any speech output system available today can still be spotted for what it is: non-human, a machine. Most older systems will fall through immediately due to their robot-like melody and garbled vowels and consonants. Other, more recently developed synthesis methods using short-segment waveform concatenation techniques such as PSOLA ( =1 (
; Moulines Charpentier 1990) ) yield segmental quality that is very close to human speech ( =1 (
; Portele et al. 1994) ), but still suffer from noticeable defects in matters of melody and timing.
As long as synthetic speech is inferior to human speech, speech output assessment will be a major concern. Speech technology development today is typically evaluation-driven. Large scale speech technology programmes have been launched both in the United States and in Europe (for overviews see =1 (
; O'Malley Caisse 1987) ; =1 (
; Van Bezooijen Pols 1989) ;
=1 (
; Pols 1991) ). Especially in the European Union, with its many official languages, a strong need was felt for output quality assessment methods and standards that can be applied across languages. With this goal in mind the multi-national EU-ESPRIT SAM-project was set up ( =1 (
; Fourcin et al. 1989) ). Later, the EU Expert Advisory Group on Language Engineering Standards (EAGLES) programme started, and included a working group on speech output assessment.
Speech output assessment may be of crucial importance to two interested parties, the systems designers on the one hand, and the prospective buyers and end users (possibly represented by consumer organisations) of the system on the other.
This chapter serves the potential needs of several disparate groups of readers. Since not all parts of this chapter will be equally relevant to every reader group, we will identify reader groups, and point out which parts of this chapter, after sections 1--3, have particular relevance for each group. The reader groups, of course, overlap to a large degree with the parties interested in speech output evaluation discussed above, but a more refined classification seems to be in order. We will distinguish the following groups of readers:
Readers in this group will be interested in section on glass
box (diagnostic) testing
as well as in section
on black box testing. When systems are
in their early
developmental stages, glass box testing will be most
relevant; when systems have
sufficiently matured, black
box tests are in order. Field
testing (section
) will
generally be deferred until the end user groups and
their
specific applications are
known. Field testing will often be conducted by, or at
least in close cooperation
with, systems procurers for end users (see below).
The chapter will be especially useful to those readers who are not test developers at this time but aspire at becoming test developers in the near future. These readers include, of course, students at the Ph.D.-level who want to make a career in speech output testing. A second group of readers who are new to the field are developers of speech technology in Eastern Europe and in certain third world countries where computer technology is now widely available at affordable prices, generating an immediate need for the development of speech output systems and tests in the languages of the areas concerned.
Given this division of reader groups we will present two types of recommendation, if and when we can.