As mentioned earlier, speech corpora are always designed for specific purposes. These
purposes determine the content and make-up of a
corpus. Thus, a speech therapist interested in pathological speech will
collect a completely different
corpus than a designer of a telephone response application. For example, in the first case hi-fi-speech
recordings are most probably needed in order to study properties of voice
quality, whereas in the latter case one
should collect speech over the phone which will result in a
rather poor quality of the speech.
In this section we will present a --- non-exhaustive --- list of possible users of speech corpora together
with the specific speech corpora they would
need.
A distinction will be made between corpora for research purposes and those
meant for technological applications. Of course, this does not mean that corpora
gathered in the one field cannot be used in the other, although there will
be
differences in the exchangeability of corpora depending on the corpus.
It must be clear that we cannot handle all the details of specific corpora, and that we will indicate
only some general properties.
The speech corpora needed for scientific purposes can be very diverse. Some researchers may need carefully pronounced lists of words to study a specific hypothesis about speech production; others may want to study samples of the vernacular , the way people speak in their everyday life. In the following paragraphs some of the major scientific fields with interest in spoken language corpora are mentioned.
In phonetic research all aspects of speech are studied. Phonetic experiments often require carefully controlled speech data, especially when basic phenomena, such as coarticulation, have to be studied in a systematic way. In this type of research more often than not the researcher will have no alternative for collecting new data, specifically designed for the investigation at hand. However, in recent years more and more attention is being paid to uncontrolled (or less controlled) forms of speech as well, because one has begun to realise that results obtained for carefully pronounced speech cannot simply be generalised to more casual speech. This type of research, which requires other experimental designs and other statistical test procedures as well, will profit considerably from existing corpora. Moreover, since the corpora that can support this type of research must by necessity be very large, it will be very unlikely that a researcher will have the opportunity to collect new, project-specific corpora.
In sociolinguistic research variation in language use is studied in heterogeneous communities, especially urban ones. Variables of interest are among others age, sex, and social status. Three common methods to gather data in this research field are:
; Labov 1972) ). He simply wrote down whether his informants pronounced an /r/ or not in specific words. The major drawback of this method is that the data collection is based on the subjective (and possibly biased) observations of a single person. In addition, the phenomenon of interest is only heard once at a possibly unexpected moment and in a possibly noisy environment (Labov, for instance, did his investigation in department stores).
; Labov 1994) ).
Psycholinguistics is a very broad scientific field in which the psychology of language is studied, including language acquisition by children, the mental processes underlying adult comprehension and production of speech, language disorders, etc. Psycholinguistic experiments sometimes involve carefully controlled speech material, for instance in phoneme monitoring or gating experiments . In phoneme monitoring experiments subjects are asked to spot the first occurrence of a specific phoneme in a spoken utterance, and press a button as soon as they have spotted it. The reaction time between the actual occurrence of the phoneme and the subject's response is used to form hypotheses about underlying mental processes. In gating experiments a progressively larger portion of words is presented to listeners, who are asked to predict what the ending will be. Both techniques can be useful to get more insight in the organisation of the mental lexicon ( =1 (
;
Aitchison 1994)
).
Another way to obtain information about the mental
lexicon and speech
production processes is to study the disfluencies in spontaneous speech. For
example, false starts tell us something about the way in which speech is
planned and articulated. Also repetitions of words or word fragments
give
information about the production and representation of speech. For this type
of research, spontaneous speech corpora are very interesting. For more
information about planning processes of speech see
=1 (
;
Levelt 1989)
.
Yet
another way to gather cues about the mental lexicon is to study `slips of
the tongue '. Many tongue-slip collectors carry
round a small notebook in which they write down errors whenever they hear
them, on a bus, at parties, etc. As
mentioned in the former section, data
acquired in this way is subjective and unreliable. The use of speech corpora
containing spontaneous speech samples would be the answer to this problem, but
investigations in this research area would only benefit
from extremely large
spontaneous speech corpora, because the number of slips of the tongue produced in any one hour of spoken speech is fairly small.
Language acquisition by children is subject of investigation in many
(sub-)disciplines of a.o. linguistics and psychology. For example, speech of
(young) children can be used to investigate (ir)regularities in language
(
linguistics); it can also be used to learn more about the mental organisation
of language ( psycholinguistics); it can be studied in relation with the
sociolinguistic background of children; or it can be used to gain
more
insight in basic phonetic processes. All these scientific fields would
thus benefit from extensive corpora containing speech of children.
Collecting language acquisition corpora is extremely time consuming and
expensive, because
of the difficulty in transcribing the speech, especially
speech of very young children. In (psycho-)linguistics a considerable amount
of work has been done to collect and transcribe corpora, and to make them
available to the research community.
Presently, only transcriptions are
readily accessible (CHILDES, **** add reference ****)
In the case of toddlers only `spontaneous' speech samples can be obtained. As
soon as children get somewhat older, more controlled forms of speech can
also
be obtained, such as naming pictures or reading texts. Game playing is another
way of eliciting quasi-controlled speech.
Speech acquisition corpora must be longitudinal, i.e. the same person must be
recorded repeatedly at subsequent stages
in his acquisition process.
Migration between language areas is as old as history, and probably much older.
Until recently, migrants were not hindered too much by
their lack of (adequate)
knowledge and command of the majority language in their new home countries.
Now that low-education jobs are becoming increasingly rare in First World
countries this situation is changing. Because command of the language is a
prerequisite to education, the study of how immigrants learn to control the
language of the host country (the `second' language) has become an important
topic in sociolinguistic research. The European Science Foundation, for
instance, has sponsored a
large scale project on second language learning in
several Western European countries. The research was corpus based: large
numbers of migrants have been recorded each fortnight during more than one
year. Transcripts and audio tapes comprising this
corpus are maintained by the
Max Planck Institute of Psycholinguistics in Nijmegen.
It is especially important to study second
language acquisition of immigrant children in
order to find out how this might influence
their education progress. In
a similar vein, research into the acquisition of
the majority language is needed in ``second generation children'' who grow up in
families which still use the language of their country of origin.
Since immigrants form a minority group in the
country they reside in, their
native language can be strongly influenced by the second language. For the
investigation of these so-called language attrition processes special purpose
corpora must be (and have been) collected. In this context one must
not only
think of African and Asian migrants who are living in the U.S.A. or Western
Europe, but also of non-Anglo Europeans who moved to the U.S.A., Canada or
Australia.
From a psycholinguistic point of view, it is interesting to study how
the
different lexicons are organised in the minds of bilingual (and multilingual)
speakers. For example, the occurrence of `blends' (combinations of two words,
in this case from different languages) shows that words are subconsciously
activated in both
languages (
=1 (
; Green 1986) ). Up to now, most of the research into bilingual lexicons has taken the form of controlled experiments (e.g. cross language priming in lexical decision tasks). It is conceivable, though, that large corpora of spontaneous speech of bilinguals could be used to study lexical and syntactic interferences between the languages.
A substantial part of modern linguistic
research is based on Chomsky's
`generative paradigm'. The goal of this (so called mentalistic) research
programme is to eventually understand the competence of language users,
i.e. their abstract knowledge of the language system.
What speakers and
hearers actually do, i.e. their performance, is usually of less interest
to linguists in Chomsky's tradition. The construction of competence models is
generally based on introspection and impressionistic ideas about language
use.
So, in its strictest form mentalistic linguistic research cannot benefit much
from speech corpora that contain samples of the performance of language users.
However, most linguist no longer believe that performance can be
neglected
completely. For one thing, it has been noted that spontaneous speech corpora
often contain utterances which would seem implausible (if not impossible) from
introspection, but which are perfectly natural and acceptable in context.
And
conversely, sentences invented to illustrate grammatical points may be
implausible as actual utterances, because it is extremely difficult to imagine
a situation in which they would not violate discourse constraints,
aspectual perspectives taken on
events, etc. (
=1 (
;
Chafe 1992)
). Moreover, only an
integrated theory of competence and performance would ultimately be able
to account for actual language phenomena. In this respect speech corpora are
indispensable to fill the
gap between a competence grammar and actual language
use.
Presently, more and more linguists are starting to realise the importance of
linguistic analysis of constructs of larger size than isolated sentences or
utterances. Discourse
Analysis is the branch of Linguistics which is
concerned with the analysis of naturally occurring connected
spoken or written discourse (
=1 (
;
Stubbs 1984)
). Obviously, Discourse Analysis
will profit very much from large corpora
of meaningful speech, whether it is
conversational or more formal, e.g. in information seeking dialogues.
In
=1 (
; Edwards Lampert 1993) a comprehensive methodology is presented for transcription and coding of discourse data from various perspectives. This book also contains a list of language corpora that might be useful in discourse research.
Audiology is the scientific study of hearing, often
including the
treatment of persons with hearing defects. A conventional audiometer can
be used to test the intensity and frequency range of pure tones that the human
ear can detect. This instrument can give a rough indication of the degree
of
hearing loss in hearing-impaired persons. Present day evaluation of hearing
includes the use of controlled speech samples to assist in the determination
of a patient's communicative capabilities.
Interest in the use of speech to measure hearing
has been centered around a
research orientation and a practical-clinical orientation. The first
orientation has resulted in research areas such as experimental phonetics, the
effects of various types of distortion on the human speech recognition
and
speaker identification, etc. The second orientation has led to research in
areas such as the effects of hearing loss on the reception of speech, auditory
processing, and the effects of modifications in the range of reception of
speech. The second
area more or less grew out of the research in the first
area (
=1 (
;
O'Neill 1975)
).
For speech corpora to be useful in audiology they must be carefully
calibrated, establishing performance (e.g. in terms of recognition scores)
of
non-hearing impaired reference subjects. Audiological test corpora may contain
various types of speech stimuli to evaluate normal and disordered hearing
acuity. The speech stimuli can consist of isolated phonemes, nonsense words or
real words, and
also of connected forms of speech. [Reference must be made
here to Harvard sentences, Haskins sentences, as well as to existing CD-ROMs
containing calibrated speech and audio material (e.g. the CD produced by TNO
in Soesterberg).]
In this scientific field various types of pathological speech are studied, ranging from mild disorders such as hoarseness to severe disorders such as aphasia. The aim of most studies of pathological speech is to find therapies that can alleviate or cure the speech disorder of interest. However, phenomena like aphasia can also be subject of psycholinguistic studies, because such language disorders can shed some light on underlying mental processes ( =1 (
; Aitchison 1994) ). Corpora of pathological speech can thus be very useful.
Technological applications for which speech corpora are needed can be roughly divided in four major classes: speech synthesis, speech recognition, spoken language systems, and speaker recognition/verification. Depending on the specific application, the speech corpora can be very diverse. For example, speech synthesis usually requires a large amount of speech data from one or two speakers, whereas speech recognition often requires a smaller amount of speech data from many speakers. In the following paragraphs the four domains of speech research for technological applications and the speech corpora they need are discussed.
The seemingly most natural way to synthesise speech is to model human
speech
production directly by simulating lung pressure, vocal fold vibration,
articulatory gestures, etc. However, it turns out to be extremely difficult to
determine and control the details of the model parameters in computer
simulations. This is the
reason that articulatory synthesisers have only been
moderately successful in generating perceptually important acoustic
features. Yet, modern measurement techniques have allowed the collection of
substantial amounts of measurement data. Most of these
data are now being made
available to the research community. (ACCOR, *** add reference ***)
A relatively simple way to build a speech synthesiser is through concatenation
of stored human speech components. In order to achieve natural
coarticulation
in the synthesised speech, it is necessary to include transition regions in
the building blocks. Often-used transition units are diphones, which
represent the transition from one phone to another. Since diphone inventories
are
derived directly from human utterances, diphone synthesis might be
expected to be inherently natural sounding. However, this is not completely
true, because the diphones have to be concatenated and in practice there will
be many diphone junctions that
do not fit properly together. In order to be
able to smoothe these discontinuities the waveform segments have to be converted
to a convenient form, such as some form of LPC parameters, often with some
inherent loss of auditory quality. Until recently
it was believed that a
parametric representation was mandatory to be able to change the pitch and
timing of utterances without disturbing the spectral envelope pattern. Since
the invention of PSOLA-like techniques high quality pitch and time changes
can
be effected in the time domain.
Another important means to generate computerised speech is through
synthesis by rule. The usual approach is to input a string of allophones to
some form of formant synthesiser. Target formant values
for each allophone are
derived from human utterances and these values are stored in large tables.
With an additional set of rules these target values can be adapted to account
for all kinds of phonological and phonetic phenomena and to generate a
proper
prosody.
More detailed accounts of speech synthesis systems can be found in, for
instance,
=1 (
; Klatt 1987) and =1 (
;
Holmes 1988)
.
For all three types of speech synthesis systems corpora are
needed
to determine the model parameters. If the user wants many different
types of voice, the speech corpus should contain various speakers for the
extraction of speaker specific model parameters. In particular, the user might
want to be able to generate both
male and female speech. Transformations to
convert rule systems between male and female speech have had only limited
success, so it seems more convenient to include both sexes in the speech
corpus. Application specific corpora are needed to investigate
issues related
to prosody.
There are several types of speech recognition systems, which can differ in three important ways:
These different aspects will be discussed below.
With
respect to the strategies they use, speech
recognition systems can be roughly divided in two
classes: knowledge-based systems and stochastic systems. All
state-of-the-art systems belong to the second category. In the --- now
essentially abandoned --- knowledge-based approach an attempt was made to
specify explicit acoustic-phonetic rules that are robust enough to allow
recognition of linguistically meaningful units and that
ignore irrelevant variation in these units.
Stochastic
systems, such as Hidden Markov Models (HMMs) or
neural networks, do not use explicit rules for speech
recognition. On the contrary, they rely on stochastic models which are
estimated or trained with (very) large amounts of speech,
using some
statistical optimalisation procedure (e.g. the Estimate-Maximise or the
Baum-Welch algorithm).
Higher level linguistic knowledge can be used to constrain the recognition
hypotheses generated at the acoustic-phonetic level. Higher
level knowledge
can be represented by knowledge-based explicit rules, for example syntactic
constraints on word order. More often it is represented by stochastic language
models, for example bigrams or trigrams that reflect
the
likelihood of a sequence of two or three words, respectively (see also Ney's
chapter).
Speech recognition systems can be either speaker-dependent or speaker-independent. In the former case the recognition system is designed to recognise the speech of just a single person, and in the latter case the recognition system should be able to recognise the speech of a variety of speakers. All other things being equal the performance of speaker-independent systems is likely to be worse than in speaker-dependent systems, because speaker-independent systems have to deal with a considerable amount of inter-speaker variability. It is often sensible to train separate recognition models for specific subgroups of speakers, such as men and women, or speakers with different dialects ( =1 (
;
Van Compernolle et al. 1991)
).
Some systems can to some
extent adapt to new speakers by adjusting the
parameters of their models. This can be done in a separate training session
with a set of predetermined utterances of the new speaker, or it can be done
on-line as the recognition of the new speaker's
utterances gradually
proceeds.
Most recognition systems are very sensitive to the recording environment.
In the past, speakers used to train and develop a system were often recorded
under `laboratory' conditions, for instance in an anechoic room.
It appears
that the performance of speech recognisers, which are trained with such high
quality recordings, severely degrades if they are tested with some form of
`noisy' speech (
=1 (
; Gong 1995) ). Also the use of different microphones during training sessions and test sessions has a considerable impact on the recognition performance.
The third possible distinction between speech
recognition
systems is based on the type of speech they
have to recognise. The system can be designed for
isolated word recognition or for continuous
speech recognition. In the latter case word boundaries
have to be established, which can be
extremely difficult.
Nevertheless, continuous speech recognition systems
are nowadays reasonably successful, although their
performance of course strongly depends on the size of
their vocabulary.
Word Spotting can be regarded as a
special form of isolated word
recognition: the recogniser is `listening' for a limited number of words,
these words may come embedded in background noise, possibly consisting of
speech (of competing speakers, but also of the target speaker who is
producing
the word embedded in extraneous speech).
In general, two speech corpora are needed for the development of speech
recognition systems: one for the
training phase and one for the testing phase.
The training material is used to determine the model parameters of the
recognition system. The testing material is used to determine the performance
of the trained system. It is necessary to use different
speech data for
training and testing in order to get a fair evaluation of the system
performance.
For speaker-dependent systems, obviously the same speaker is used for the
training and testing phase. For
speaker-independent systems, the corpora
for training and testing could
contain the same speakers (but different speech data), or they could contain
different speakers to determine the system's robustness for new speakers.
When a system is designed for isolated word recognition, it
should be trained
and tested with isolated words. And similarly, when a system is designed for
telephone speech, it should be trained and tested with telephone speech. The
design of corpora for speech recognition research thus strongly depends on
the
type of recognition system that one wants to develop. Several large corpora
for isolated word (e.g. TIDIGITS) and continuous speech recognition (e.g.
Resource Management ATIS, and Wall Street Journal) have been
collected and made available,
especially in the American (D)ARPA
programmes.
Speech synthesis and speech recognition systems can be combined with Natural
Language Processing and Dialogue Management
systems to form a Spoken Language
System (SLS) that allows an interactive communication between man and machine. A
spoken language system should be able to
recognise a person's speech, interpret the sequence of words to obtain
a
meaning in terms of the application, and provide an appropriate response
to the user.
Apart from speech corpora needed to design the speech synthesis and the
speech recognition part of the spoken language system,
speech corpora are also needed
to model relevant features of spontaneous
speech (pauses, hesitations, turn-taking behaviour, etc.) and to model
dialogue structures for a proper man-machine interaction.
An excellent
overview of spoken language systems and their problems is
given in
=1 (
; Cole 1995) . The ATIS corpora mentioned above exemplify the type of corpus needed for the development of SLS.
The task of automatic
speaker recognition is to determine the identity of a
speaker by machine. Speaker recognition (usually called speaker
identification can be divided into two
categories: closed-set and open-set
problems. The closed-set
problem is to identify a speaker from a group of known speakers, whereas the
open-set problem is to decide whether a speaker belongs to a group of known
speakers. Speaker verification is a special case of the open-set
problem
and refers to the task of deciding whether a speaker is who he claims
to be.
Speaker recognition can be text-dependent or it can be
text-independent. In the former case the text in both the training phase
and the testing
phase is known, i.e., the system employs a sort of password
procedure. Knowledge of the text enables the use of systems which combine
speech and speaker recognition. However, password systems are volatile to
fraud using recordings of the passwords
spoken by a customer. One way of
making fraud with recordings much more difficult is by the use of text
prompted techniques, whereby the customer is asked to repeat one or more
sentences randomly drawn from a very large set of possible
sentences.
In the case of text-independent speaker verification the acceptation
procedure should work for any text in both the training or the testing phase.
There are various possible applications of speaker recognition, for instance
helping to
identify suspects in forensic cases, or controlling access to
buildings or bank accounts. As with speech recognition , the corpora needed for speaker recognition or speaker
verification are dependent on the
specific
application. In any case a speech corpus for the training and testing
of speaker recognition/verification systems has to contain some stretch of
speech from a number of speakers. For a text-dependent system a fixed text is
read out by the
speakers; for a text-independent system the speech corpus can
contain any kind of speech, including samples of spontaneous conversational
speech. Furthermore, it is very important that speech is recorded in different
types of environments (quiet ones
and noisy ones) and that different types of
microphones are used. Especially in the case of speaker verification one has
to be sure that the speaker is recognised and not the environment. More
detailed accounts of speaker recognition can be found in
=1 (
; O'Shaughnessy 1986) and =1 (
; Gish Schmidt 1994) . LDC offers a number of corpora designed with speaker recognition research in mind, e.g. the very large Switchboard corpus.