next up previous contents
Next: Specification of the Up: Corpus design Previous: Seven main differences

Applications of spoken language corpora

As mentioned earlier, speech corpora are always designed for specific purposes. These purposes determine the content and make-up of a corpus. Thus, a speech therapist interested in pathological speech  will collect a completely different corpus than a designer of a telephone response application. For example, in the first case hi-fi-speech recordings are most probably needed in order to study properties of voice quality, whereas in the latter case one should collect speech over the phone which will result in a rather poor quality of the speech.
In this section we will present a --- non-exhaustive --- list of possible users of speech corpora together with the specific speech corpora they would need. A distinction will be made between corpora for research purposes and those meant for technological applications. Of course, this does not mean that corpora gathered in the one field cannot be used in the other, although there will be differences in the exchangeability of corpora depending on the corpus. It must be clear that we cannot handle all the details of specific corpora, and that we will indicate only some general properties.

Speech corpora for research purposes

The speech corpora needed for scientific purposes can be very diverse. Some researchers may need carefully pronounced lists of words to study a specific hypothesis about speech production; others may want to study samples of the vernacular , the way people speak in their everyday life. In the following paragraphs some of the major scientific fields with interest in spoken language corpora are mentioned.

Phonetic research

In phonetic research all aspects of speech are studied. Phonetic experiments often require carefully controlled speech data, especially when basic phenomena, such as coarticulation, have to be studied in a systematic way. In this type of research more often than not the researcher will have no alternative for collecting new data, specifically designed for the investigation at hand. However, in recent years more and more attention is being paid to uncontrolled (or less controlled) forms of speech as well, because one has begun to realise that results obtained for carefully pronounced speech cannot simply be generalised to more casual speech. This type of research, which requires other experimental designs and other statistical test procedures as well, will profit considerably from existing corpora. Moreover, since the corpora that can support this type of research must by necessity be very large, it will be very unlikely that a researcher will have the opportunity to collect new, project-specific corpora.

Sociolinguistic research

In sociolinguistic  research variation in language use is studied in heterogeneous communities, especially urban ones. Variables of interest are among others age, sex, and social status. Three common methods to gather data in this research field are:

Dialect research is closely related to sociolect research. In dialect studies variation in language use due to differences in geographical background of speakers is investigated. Since the methods of data collection are similar to the ones used in sociolect research, the remarks made above also apply to dialect research.

Psycholinguistic research

Psycholinguistics  is a very broad scientific field in which the psychology of language is studied, including language acquisition by children, the mental processes underlying adult comprehension and production of speech, language disorders, etc. Psycholinguistic  experiments sometimes involve carefully controlled speech material, for instance in phoneme monitoring  or gating experiments . In phoneme monitoring  experiments subjects are asked to spot the first occurrence of a specific phoneme in a spoken utterance, and press a button as soon as they have spotted it. The reaction time between the actual occurrence of the phoneme and the subject's response is used to form hypotheses about underlying mental processes. In gating experiments  a progressively larger portion of words is presented to listeners, who are asked to predict what the ending will be. Both techniques can be useful to get more insight in the organisation of the mental lexicon ( =1 (

; Aitchison 1994) ).
Another way to obtain information about the mental lexicon and speech production processes is to study the disfluencies in spontaneous speech. For example, false starts tell us something about the way in which speech is planned and articulated. Also repetitions of words or word fragments give information about the production and representation of speech. For this type of research, spontaneous speech corpora are very interesting. For more information about planning processes of speech see =1 (

; Levelt 1989) .
Yet another way to gather cues about the mental lexicon is to study `slips of the tongue '. Many tongue-slip collectors carry round a small notebook in which they write down errors whenever they hear them, on a bus, at parties, etc. As mentioned in the former section, data acquired in this way is subjective and unreliable. The use of speech corpora containing spontaneous speech samples would be the answer to this problem, but investigations in this research area would only benefit from extremely large spontaneous speech corpora, because the number of slips of the tongue  produced in any one hour of spoken speech is fairly small.

First language acquisition

Language acquisition by children is subject of investigation in many (sub-)disciplines of a.o. linguistics and psychology. For example, speech of (young) children can be used to investigate (ir)regularities in language ( linguistics); it can also be used to learn more about the mental organisation of language ( psycholinguistics); it can be studied in relation with the sociolinguistic background of children; or it can be used to gain more insight in basic phonetic processes. All these scientific fields would thus benefit from extensive corpora containing speech of children.
Collecting language acquisition corpora is extremely time consuming and expensive, because of the difficulty in transcribing the speech, especially speech of very young children. In (psycho-)linguistics a considerable amount of work has been done to collect and transcribe corpora, and to make them available to the research community. Presently, only transcriptions are readily accessible (CHILDES, **** add reference ****)
In the case of toddlers only `spontaneous' speech samples can be obtained. As soon as children get somewhat older, more controlled forms of speech can also be obtained, such as naming pictures or reading texts. Game playing is another way of eliciting quasi-controlled speech.
Speech acquisition corpora must be longitudinal, i.e. the same person must be recorded repeatedly at subsequent stages in his acquisition process.

Second language acquisition

Migration between language areas is as old as history, and probably much older. Until recently, migrants were not hindered too much by their lack of (adequate) knowledge and command of the majority language in their new home countries. Now that low-education jobs are becoming increasingly rare in First World countries this situation is changing. Because command of the language is a prerequisite to education, the study of how immigrants learn to control the language of the host country (the `second' language) has become an important topic in sociolinguistic research. The European Science Foundation, for instance, has sponsored a large scale project on second language learning in several Western European countries. The research was corpus based: large numbers of migrants have been recorded each fortnight during more than one year. Transcripts and audio tapes comprising this corpus are maintained by the Max Planck Institute of Psycholinguistics in Nijmegen.
It is especially important to study second language acquisition of immigrant children in order to find out how this might influence their education progress. In a similar vein, research into the acquisition of the majority language is needed in ``second generation children'' who grow up in families which still use the language of their country of origin.
Since immigrants form a minority group in the country they reside in, their native language can be strongly influenced by the second language. For the investigation of these so-called language attrition processes special purpose corpora must be (and have been) collected. In this context one must not only think of African and Asian migrants who are living in the U.S.A. or Western Europe, but also of non-Anglo Europeans who moved to the U.S.A., Canada or Australia.
From a psycholinguistic point of view, it is interesting to study how the different lexicons are organised in the minds of bilingual (and multilingual) speakers. For example, the occurrence of `blends' (combinations of two words, in this case from different languages) shows that words are subconsciously activated in both languages ( =1 (

; Green 1986) ). Up to now, most of the research into bilingual lexicons has taken the form of controlled experiments (e.g. cross language priming in lexical decision tasks). It is conceivable, though, that large corpora of spontaneous speech of bilinguals could be used to study lexical and syntactic interferences between the languages.

General linguistic research

A substantial part of modern linguistic research is based on Chomsky's `generative paradigm'. The goal of this (so called mentalistic) research programme is to eventually understand the competence of language users, i.e. their abstract knowledge of the language system. What speakers and hearers actually do, i.e. their performance, is usually of less interest to linguists in Chomsky's tradition. The construction of competence models is generally based on introspection and impressionistic ideas about language use. So, in its strictest form mentalistic linguistic research cannot benefit much from speech corpora that contain samples of the performance of language users.
However, most linguist no longer believe that performance can be neglected completely. For one thing, it has been noted that spontaneous speech corpora often contain utterances which would seem implausible (if not impossible) from introspection, but which are perfectly natural and acceptable in context. And conversely, sentences invented to illustrate grammatical points may be implausible as actual utterances, because it is extremely difficult to imagine a situation in which they would not violate discourse constraints, aspectual perspectives taken on events, etc. ( =1 (

; Chafe 1992) ). Moreover, only an integrated theory of competence and performance would ultimately be able to account for actual language phenomena. In this respect speech corpora are indispensable to fill the gap between a competence grammar and actual language use.
Presently, more and more linguists are starting to realise the importance of linguistic analysis of constructs of larger size than isolated sentences or utterances. Discourse Analysis is the branch of Linguistics which is concerned with the analysis of naturally occurring connected spoken or written discourse ( =1 (

; Stubbs 1984) ). Obviously, Discourse Analysis will profit very much from large corpora of meaningful speech, whether it is conversational or more formal, e.g. in information seeking dialogues.
In =1 (

; Edwards Lampert 1993) a comprehensive methodology is presented for transcription and coding of discourse data from various perspectives. This book also contains a list of language corpora that might be useful in discourse research.

Audiology

Audiology is the scientific study of hearing, often including the treatment of persons with hearing defects. A conventional audiometer can be used to test the intensity and frequency range of pure tones that the human ear can detect. This instrument can give a rough indication of the degree of hearing loss in hearing-impaired persons. Present day evaluation of hearing includes the use of controlled speech samples to assist in the determination of a patient's communicative capabilities.
Interest in the use of speech to measure hearing has been centered around a research orientation and a practical-clinical orientation. The first orientation has resulted in research areas such as experimental phonetics, the effects of various types of distortion on the human speech recognition and speaker identification, etc. The second orientation has led to research in areas such as the effects of hearing loss on the reception of speech, auditory processing, and the effects of modifications in the range of reception of speech. The second area more or less grew out of the research in the first area ( =1 (

; O'Neill 1975) ).
For speech corpora to be useful in audiology they must be carefully calibrated, establishing performance (e.g. in terms of recognition scores) of non-hearing impaired reference subjects. Audiological test corpora may contain various types of speech stimuli to evaluate normal and disordered hearing acuity. The speech stimuli can consist of isolated phonemes, nonsense words or real words, and also of connected forms of speech. [Reference must be made here to Harvard sentences, Haskins sentences, as well as to existing CD-ROMs containing calibrated speech and audio material (e.g. the CD produced by TNO in Soesterberg).]

Speech pathology

In this scientific field various types of pathological speech are studied, ranging from mild disorders such as hoarseness to severe disorders such as aphasia. The aim of most studies of pathological speech is to find therapies that can alleviate or cure the speech disorder of interest. However, phenomena like aphasia can also be subject of psycholinguistic studies, because such language disorders can shed some light on underlying mental processes ( =1 (

; Aitchison 1994) ). Corpora of pathological speech can thus be very useful.

Speech corpora for technological applications

Technological applications for which speech corpora are needed can be roughly divided in four major classes: speech synthesis, speech recognition, spoken language systems, and speaker recognition/verification. Depending on the specific application, the speech corpora can be very diverse. For example, speech synthesis usually requires a large amount of speech data from one or two speakers, whereas speech recognition often requires a smaller amount of speech data from many speakers. In the following paragraphs the four domains of speech research for technological applications and the speech corpora they need are discussed.

Speech synthesis

The seemingly most natural way to synthesise speech is to model human speech production directly by simulating lung pressure, vocal fold vibration, articulatory gestures, etc. However, it turns out to be extremely difficult to determine and control the details of the model parameters in computer simulations. This is the reason that articulatory synthesisers have only been moderately successful in generating perceptually important acoustic features. Yet, modern measurement techniques have allowed the collection of substantial amounts of measurement data. Most of these data are now being made available to the research community. (ACCOR, *** add reference ***)
A relatively simple way to build a speech synthesiser is through concatenation of stored human speech components. In order to achieve natural coarticulation in the synthesised speech, it is necessary to include transition regions in the building blocks. Often-used transition units are diphones, which represent the transition from one phone to another. Since diphone inventories are derived directly from human utterances, diphone synthesis might be expected to be inherently natural sounding. However, this is not completely true, because the diphones have to be concatenated and in practice there will be many diphone junctions that do not fit properly together. In order to be able to smoothe these discontinuities the waveform segments have to be converted to a convenient form, such as some form of LPC parameters, often with some inherent loss of auditory quality. Until recently it was believed that a parametric representation was mandatory to be able to change the pitch and timing of utterances without disturbing the spectral envelope pattern. Since the invention of PSOLA-like techniques high quality pitch and time changes can be effected in the time domain.
Another important means to generate computerised speech is through synthesis by rule. The usual approach is to input a string of allophones to some form of formant synthesiser. Target formant values for each allophone are derived from human utterances and these values are stored in large tables. With an additional set of rules these target values can be adapted to account for all kinds of phonological and phonetic phenomena and to generate a proper prosody.
More detailed accounts of speech synthesis systems can be found in, for instance, =1 (

; Klatt 1987) and =1 (

; Holmes 1988) .
For all three types of speech synthesis systems corpora are needed to determine the model parameters. If the user wants many different types of voice, the speech corpus should contain various speakers for the extraction of speaker specific model parameters. In particular, the user might want to be able to generate both male and female speech. Transformations to convert rule systems between male and female speech have had only limited success, so it seems more convenient to include both sexes in the speech corpus. Application specific corpora are needed to investigate issues related to prosody.

Speech recognition

There are several types of speech recognition systems, which can differ in three important ways:

These different aspects will be discussed below.

Knowledge-based vs. stochastic systems

With respect to the strategies they use, speech recognition systems can be roughly divided in two classes: knowledge-based systems and stochastic systems. All state-of-the-art systems belong to the second category. In the --- now essentially abandoned --- knowledge-based approach an attempt was made to specify explicit acoustic-phonetic rules that are robust enough to allow recognition of linguistically meaningful units and that ignore irrelevant variation in these units. Stochastic systems, such as Hidden Markov Models (HMMs) or neural networks, do not use explicit rules for speech recognition. On the contrary, they rely on stochastic models which are estimated or trained with (very) large amounts of speech, using some statistical optimalisation procedure (e.g. the Estimate-Maximise or the Baum-Welch algorithm).
Higher level linguistic knowledge can be used to constrain the recognition hypotheses generated at the acoustic-phonetic level. Higher level knowledge can be represented by knowledge-based explicit rules, for example syntactic constraints on word order. More often it is represented by stochastic language models, for example bigrams or trigrams that reflect the likelihood of a sequence of two or three words, respectively (see also Ney's chapter).

Speaker-independent vs. speaker-dependent systems

Speech recognition systems can be either speaker-dependent or speaker-independent. In the former case the recognition system is designed to recognise the speech of just a single person, and in the latter case the recognition system should be able to recognise the speech of a variety of speakers. All other things being equal the performance of speaker-independent systems is likely to be worse than in speaker-dependent systems, because speaker-independent systems have to deal with a considerable amount of inter-speaker variability. It is often sensible to train separate recognition models for specific subgroups of speakers, such as men and women, or speakers with different dialects ( =1 (

; Van Compernolle et al. 1991) ).
Some systems can to some extent adapt to new speakers by adjusting the parameters of their models. This can be done in a separate training session with a set of predetermined utterances of the new speaker, or it can be done on-line as the recognition of the new speaker's utterances gradually proceeds.
Most recognition systems are very sensitive to the recording environment. In the past, speakers used to train and develop a system were often recorded under `laboratory' conditions, for instance in an anechoic room. It appears that the performance of speech recognisers, which are trained with such high quality recordings, severely degrades if they are tested with some form of `noisy' speech ( =1 (

; Gong 1995) ). Also the use of different microphones during training sessions and test sessions has a considerable impact on the recognition performance.

Isolated words vs. continuous speech

The third possible distinction between speech recognition systems is based on the type of speech they have to recognise. The system can be designed for isolated word recognition or for continuous speech recognition. In the latter case word boundaries have to be established, which can be extremely difficult. Nevertheless, continuous speech recognition systems are nowadays reasonably successful, although their performance of course strongly depends on the size of their vocabulary.
Word Spotting can be regarded as a special form of isolated word recognition: the recogniser is `listening' for a limited number of words, these words may come embedded in background noise, possibly consisting of speech (of competing speakers, but also of the target speaker who is producing the word embedded in extraneous speech).

Corpora for speech recognition research

In general, two speech corpora are needed for the development of speech recognition systems: one for the training phase and one for the testing phase. The training material is used to determine the model parameters of the recognition system. The testing material is used to determine the performance of the trained system. It is necessary to use different speech data for training and testing in order to get a fair evaluation of the system performance.
For speaker-dependent systems, obviously the same speaker is used for the training and testing phase. For speaker-independent systems, the corpora for training and testing could contain the same speakers (but different speech data), or they could contain different speakers to determine the system's robustness for new speakers.
When a system is designed for isolated word recognition, it should be trained and tested with isolated words. And similarly, when a system is designed for telephone speech, it should be trained and tested with telephone speech. The design of corpora for speech recognition research thus strongly depends on the type of recognition system that one wants to develop. Several large corpora for isolated word (e.g. TIDIGITS) and continuous speech recognition (e.g. Resource Management ATIS, and Wall Street Journal) have been collected and made available, especially in the American (D)ARPA programmes.

Spoken language systems

Speech synthesis and speech recognition systems can be combined with Natural Language Processing and Dialogue Management systems to form a Spoken Language System (SLS) that allows an interactive communication between man and machine. A spoken language system  should be able to recognise a person's speech, interpret the sequence of words to obtain a meaning in terms of the application, and provide an appropriate response to the user.
Apart from speech corpora needed to design the speech synthesis and the speech recognition part of the spoken language system, speech corpora are also needed to model relevant features of spontaneous speech (pauses, hesitations, turn-taking behaviour, etc.) and to model dialogue structures for a proper man-machine interaction.
An excellent overview of spoken language systems and their problems is given in =1 (

; Cole 1995) . The ATIS corpora mentioned above exemplify the type of corpus needed for the development of SLS.

Speaker recognition/verification

The task of automatic speaker recognition is to determine the identity of a speaker by machine. Speaker recognition (usually called speaker identification  can be divided into two categories: closed-set and open-set problems. The closed-set problem is to identify a speaker from a group of known speakers, whereas the open-set problem is to decide whether a speaker belongs to a group of known speakers. Speaker verification is a special case of the open-set problem and refers to the task of deciding whether a speaker is who he claims to be.
Speaker recognition can be text-dependent or it can be text-independent. In the former case the text in both the training phase and the testing phase is known, i.e., the system employs a sort of password procedure. Knowledge of the text enables the use of systems which combine speech and speaker recognition. However, password systems are volatile to fraud using recordings of the passwords spoken by a customer. One way of making fraud with recordings much more difficult is by the use of text prompted techniques, whereby the customer is asked to repeat one or more sentences randomly drawn from a very large set of possible sentences. In the case of text-independent speaker verification the acceptation procedure should work for any text in both the training or the testing phase.
There are various possible applications of speaker recognition, for instance helping to identify suspects in forensic cases, or controlling access to buildings or bank accounts. As with speech recognition , the corpora needed for speaker recognition or speaker verification  are dependent on the specific application. In any case a speech corpus for the training and testing of speaker recognition/verification systems has to contain some stretch of speech from a number of speakers. For a text-dependent system a fixed text is read out by the speakers; for a text-independent system the speech corpus can contain any kind of speech, including samples of spontaneous conversational speech. Furthermore, it is very important that speech is recorded in different types of environments (quiet ones and noisy ones) and that different types of microphones are used. Especially in the case of speaker verification one has to be sure that the speaker is recognised and not the environment. More detailed accounts of speaker recognition can be found in =1 (

; O'Shaughnessy 1986) and =1 (

; Gish Schmidt 1994) . LDC offers a number of corpora designed with speaker recognition research in mind, e.g. the very large Switchboard corpus.



next up previous contents
Next: Specification of the Up: Corpus design Previous: Seven main differences



WWW Administrator
Fri May 19 11:53:36 MET DST 1995