The speech material in a corpus can vary from isolated sounds to complete conversations. In
general, the extent to which the experimenter has
control over the speech material decreases as it
becomes more and more spontaneous and natural.
The term natural refers to a rather intuitive concept that can be interpreted in different ways.
We regard speech to be maximally natural when two
or more speakers have a
conversation in a familiar environment about a subject
they themselves
choose to talk about, since this is the situation for which speech was `invented'.
Although read aloud speech is a commonly used
speaking style (and may be
regarded as a natural speaking style from a sociolinguistic point of view),
we regard this style as derived from the most natural style mentioned above.
When reading a text, people
have the tendency to speak more
formally and
to articulate more carefully than when they are involved in free conversation.
Thus, in our opinion the naturalness of speech should be
judged on a gradual scale.
It should be noted
that control
over the speech material is not always necessary and may
even be counterproductive, especially when one wants to study the variation
of speech as a function of communicative context.
However, strict control over the speech material is required for some
applications, such as the
development of speech synthesis systems.
In the following paragraphs eight types of speech data
will be distinguished.
Vowels pronounced in isolation (or in a `neutral' context, such as /ht/) are often used as frame of reference for experiments in which vowels from connected speech are investigated. Continuant consonants, e.g. /l, r, w, j, n, m, s, f/, can also be pronounced in isolation. Non-continuants, e.g. /p, t, k, b, d, g/, must be followed or preceded by a vowel, e.g. the `neutral' schwa .
Isolated words can be either `nonsense' words
or
existing
words. In the case of nonsense words the experimenter
can
create all possible kinds of phonotactically correct sound sequences.
This gives the opportunity to
study coarticulation in a systematic way. Nonsense words are also
used to extract models for a dictionary of phonetic elements
when a synthesis system is developed.
When
existing words are used, the number of possible sound sequences is
restricted to what is phonotactically appropriate in the lexicon of a given
language. It must be realised that
control over the sounds produced by the speakers may not be perfect,
because the pronunciation of
polysyllabic
words can be influenced by the stress pattern, which may
be ambiguous (cf. words like record in
English).
When speakers have to read aloud a list of isolated words,
their pronunciation may be influenced
by the orthographic
representation of the words, a phenomenon known as spelling pronunciation.
Spelling pronunciation is
especially apparent in languages which form nominal compounds ; if
sound sequences occur across the morpheme boundaries that
would lead to assimilation and degemination in connected speech,
one
should still
anticipate that in reading aloud all
sounds are realised. This
phenomenon
can be circumvented by
having the speakers name the words through the presentation of pictures, but this
method can only be applied to a very limited number of
words. It is, for instance, not suitable for
abstract concepts.
The carrier sentence is one type of an isolated sentence.
Carrier sentences are often used when one
wants to
get a
somewhat more natural pronunciation of (nonsense)
words in comparison with
words spoken in isolation,
especially with respect to speech rate. The test words are embedded in the
carrier
sentence, as illustrated
by the example
``I will say -- a test word -- again''. The same carrier
sentence is
used repeatedly for all occurring test words, so that the influence of the acoustic and
linguistic context
on the test words is controlled.
More natural speech material can be obtained when `normal' (linguistically meaningful)
sentences are constructed by the experimenter. Such sentences can be used to train
phoneme based recognisers and to study,
for
instance, word stress or coarticulation in a relatively
natural linguistic context. One should note that an isolated
sentence may be interpreted by a speaker in a wider semantic context, which can
influence the pronunciation of the sentence, especially with respect to the position of
sentence accent(s) .
Sometimes a semantic relation between subsequent `isolated' sentences may arise as a
result of the specific ordering
of
the speech material. Since individual speakers may imagine a different semantic context
for a specific sentence, variability in the suprasegmental
features of the test
sentences can occur. If desired,
this variability can be
reduced by using punctuation and other typographical means (for
instance, capitals or boldface characters) to indicate words that should have a sentence
accent .
A more natural way of doing this, is to let each sentence be
preceded by a question that evokes sentence accents
at the
desired positions. It should be clear, however, that neither practice can be recommended in the collection
of
large corpora of telephone speech.
For many
purposes, such as the development of a phoneme-based recogniser, it is
crucial that all phonemes are represented
in
the speech corpus in sufficiently high numbers. Due to the large differences in
frequency of occurrence of the
phonemes in
the
language in general, uniform phoneme frequencies
will not obtain
in randomly chosen sentence material: such material will,
instead, reflect the differences in phoneme
frequencies. It is
proposed to reserve the term phonetically balanced for
speech material containing phonemes according to their
frequency
of occurrence in the general language.
Phonetically
balanced sentences may be used for speech audiometry and for testing the transmission
characteristics of communication
channels or public address systems.
Approximately uniform phoneme frequency
distributions can be achieved by using
phonetically rich sentences. For that purpose
greedy algorithms (
=1 (
; Van Santen 1992) ) can be used. Suppose you want to have a set of sentences in which each phoneme of the language of interest occurs at least once. Of course, you could try to create this set of sentences yourself, but this would be difficult and time-consuming. Furthermore, you might end up with sentences that look rather `constructed'. An alternative would be to search for an appropriate set of sentences in a sufficient large text corpus, for instance, a large amount of newspaper data on CD-ROM. A greedy algorithm can be used to obtain the minimum number of sentences containing all phonemes. The following steps have to be taken to get the desired test set:
The naturalness of the produced speech may increase even more when speakers read aloud a series of sentences that are semantically related, provided that the subject is able and used to read aloud paragraph length material. The prompting material can consist of a text fragment taken from, for instance, a newspaper or a book. But the text fragment can also be created by the experimenter, when it is necessary to impose some specific restrictions on the speech material, for instance with respect to phonemic structure, word structure, or syntactic structure. Reading aloud a text fragment is more difficult than reading aloud a list of isolated sentences. It is very likely that the speech produced by different speakers, who are asked to read a text fragment, will vary considerably, especially with regard to aspects like vividness, speech rate, omitted speech segments , prosody , etc. The preferred position of sentence accents in a text fragment can be indicated with capitals or boldface characters. This is not recommended if one is interested in more natural speech.
When speech corpora are gathered for commercial applications, a common task of speakers is to read numbers or alpha-numerical expressions, such as ZIP-codes . Speakers have to some extent the freedom to pronounce these numbers or alpha-numerical expressions as they like. For example, there appear to be substantial differences between the ways in which subjects express telephone numbers. Some may read the telephone number as a string of digits, whereas others may read it as a string of numbers containing two or more digits. In addition, it may make a difference whether the telephone number is familiar (for instance, a friend's number), or unfamiliar. The POLYPHONE corpora are good examples of corpora that contain such semi-spontaneous speech .
The previous types of speech
material were all concerned with the reading
aloud of some piece of text by one speaker at the time (disregarding
the naming of words through the presentation of pictures). In the present
section we will discuss spontaneous
speech from one or more
speakers. The major difference between read speech and spontaneous
speech is that the former fixes vocabulary and
syntax , whereas the latter leaves speakers free to choose
their
own vocabulary and syntax . The naturalness of
the produced speech increases when speakers are allowed to choose their own
words. In order to keep some control over the speech material, the
experimenter
can determine the subject the speaker has to talk about.
The subject of conversation is relatively fixed when speakers are asked to
retell a story that they heard or read shortly before. Since it is likely that
speakers will use at least some of
the words that occurred in the story, this
method allows the experimenter to gather `spontaneously' spoken versions of
specific words of interest. In a variant of this method, speakers can be asked
to invent a story based on a cartoon (without text
balloons), or on some
complex picture that is bound to evoke the words of interest. In all these
designs, monologues are involved, although a session manager
may try to guide the discourse in the desired direction. However, one
should
be aware that many naive subjects do not feel at ease in a situation in which
they must maintain a monologue for an extended period of time. Most people
feel much more comfortable in a dialogue situation. Moreover,
interview situations provide some additional control over subjects' speech,
because the interviewer determines the subject of conversation, and
subsequently guides the conversation in the desired direction.
Another kind of guided spontaneous
speech is an
information dialogue: people who attempt to obtain information about, for
instance, train or plane schedules. Speakers ask information from an
information agent or a computer system about time and place of
departure,
destination, etc. In this way spontaneous speech can
be obtained, even if it concerns a very restricted subject. This paradigm is
used in the (D)ARPA Air Line Travel Information System (ATIS) task. Train time
table
information dialogues are now being recorded in several languages, e.g.
German, Dutch, Italian, etc. in the MLAP projects MAIS and RAILTEL.
Although a speech situation with two or more people is more natural than a
monologue , overlapping acoustic material may result from
several people speaking simultaneously. For some applications, such as
research on basic speech processes, overlapping acoustic material is difficult
or impossible to use. Of course,
when dialogues were recorded one could try to
extract speech fragments in which only a single speaker is talking. The study
of simultaneous speech from two or more speakers is important for research on
dialogue/discourse analysis ,
intention analysis, and
spoken language understanding. The gathering of multiple simultaneous speaker
corpora is still in its infancy. Such corpora seem indispensable to study
speech in all its relevant aspects. In addition, speech recognisers, which
are
up to now only able to deal with one speaker at a time, would eventually also
have to be able to deal with different speakers talking simultaneously. Speech
corpora containing dialogues could supply the training and testing data for
such advanced
recognisers. To make such corpora useful for research and
development purposes each individual speaker should be recorded on a separate
track, using a microphone array with very high directional sensitivity.
Additional tracks could then be
synthesised, simulating less perfect
directional sensitivity. Alternatively, subjects could be recorded in a
teleconference, although such distributed recordings would require extensive
precautions to allow one to synchronise the tracks originating
from completely
independent recorders.
A special type of information seeking dialogue, which is becoming increasingly
important, is the one between a human and a computer. In order to gain a clear
insight into the way people behave when they
have to interact with computers,
in the absence of computers that can entertain such a conversation,
the Wizard of Oz technique was invented. This
technique will be briefly described in the next paragraph.
In the children's novel The Wizard of Oz ( =1 (
;
Baum 1900)
) a young boy is bullied
by an oracle called the Wizard of Oz. The crux of the story is that the
Wizard
of Oz turns out to be nothing more than a device operated by a man. In the
Wizard of Oz technique a human plays the role of
the computer in a simulated human-computer interaction. Of course, the easiest
way to learn about
the way humans behave when they have to interact with
computers would be to actually have them interact with a computer. However, in
order to be able to build a computer system that can participate in a dialogue
with a human, one has to know how a
human-computer interaction is likely to
proceed. The Wizard of Oz technique can be seen as an intermediate step in the
design of such a computer system. Because the subjects who participate in a
Wizard of Oz experiment have to be convinced
that they are
actually talking to a computer, some precautions must be taken.
For example, the wizard simulating the computer should be talking with a
`computer voice' (in the case that spoken output is required), and the wizard
should also make deliberate errors
similar to the ones that a computer could
be expected to make in the application of interest.
As Spoken Language Systems are rapidly approaching a performance level
that is acceptable for an increasing range of applications, it seems likely
that
man-machine dialogue systems will be used more and more in the near future.
For the development of such systems speech data gathered in Wizard of
Oz experiments will be indispensable, as long as
at least one part of the
system is not yet good enough for experiments with
large groups of users. A more comprehensive discussion of the Wizard of Oz
technique is given in chapter 14.
As the performance of SLSs improves, the development of new applications will
be
increasingly based on pilot experiments with a system in the loop,
i.e., with test versions of the application in which the wizard is replaced by
a computer system which has enough functionality to support the human-machine
interaction.
This type of speech material, in which speakers are allowed to freely choose their own words and their own subject of conversation is most natural, especially in a dialogue situation. Most remarks that were made in the previous paragraph, also apply to the present one. As with all natural processes, the observer's paradox can play a role in the recording of spontaneous speech. That is, in order to obtain speech that is as natural as possible, the researcher has to observe how people speak when they are not being observed ( =1 (
; Labov 1972) ). To overcome this methodological paradox, several techniques have been proposed throughout the development of sociolinguistic research ( =1 (
; Argente 1991) ):
Experimental speech research has traditionally been focussed on factorial
experiments ,
that is, experiments in which a
number of factors are defined that are hypothesised to influence some aspects
of the speech behaviour, in production or in perception. The amount of speech
in these experiments has typically been small, if only because it
was
practically impossible to record large amounts of speech in production
experiments or to generate large amounts for perception experiments. The major
causes of the limitations were in the tight control of the speech needed for
well designed
factorial experiments and in the
time required from the subjects. Tight control is necessary to prevent the
outcome of factorial experiments of being
meaningless: this type of experiment requires that all
conceivable factors
different from the small number under study be kept constant, whereas the
experimental factors are being varied over a limited range. It is not our
intention to criticise factorial experiments , if
only because
they have contributed to virtually all the knowledge we have
about speech and because until recently there was hardly an alternative. Yet,
it must be acknowledged that, exactly due to the tight control, the speech
used in the older experiments may not
have been exactly `communicative'. In
the majority of the cases the subjects performed in situations which are quite
remote from normal communicative behaviour; therefore, some caution should be
exercised in generalising the results of
controlled
experiments to `normal communicative' speech.
Another reason to be careful in interpreting results of factorial
experiments is the possibility that the
experimenter did not completely
succeed in keeping all non-experimental
factors constant: it may be the case that nonexperimental factors did co-vary
with experimental ones, thereby being responsible for at least part of the
effects attributed to the experimental factor(s). One case
in point is
intonation research, that has been pretty much focussed on
pitch and on duration effects. There is, however,
increasing evidence that other factors like spectral
structure , spectral slope , spectral
dynamics , etc. also play a role, and perhaps one that
is quite important. In short: there is a danger that factorial
experiments lead to overestimating the impact of
the factors under investigation, at the cost of factors that were supposed to
be constant, but that actually co-varied so as to enforce the effects of the
experimental factors.
Now that
very large corpora become available it is possible to set up another
type of experiment, in which the behaviour of one or more specific factors is
investigated in a very large, perhaps comprehensive number of different
contexts. Instead of trying to
neutralise the effect of concomitant factors by
trying to keep them constant (which will normally mean that one of the many
different levels of such factors are selected, e.g. a voiceless
stop as the right
neighbour of the phonemes under
study, or only syllables which have a prominence lending
High-Low pitch contour), one may try instead to sample many
different contexts. Of course, in
order to make this type of research
feasible, one has to assume that subject effects can be treated in exactly the
same way as context effects, because it will still be extremely difficult to
have subjects perform for very long periods of time. In
designing corpus based
experiments one must be aware of the extreme skewing of many frequency
distributions observed in spoken language. For
instance, in all languages for which data on phoneme
frequencies
are available it has appeared that within a system some
phonemes occur much more often than other
phonemes . Random sampling would leave
one with a very high likelihood of missing
infrequent phonemes
and of missing possible contexts, unless the total corpus is made excessively
large. Greedy algorithms (cf.
=1 (
; Van Santen 1992) ) can be used to find the minimum amount of linguistic material that covers a maximum number of phenomena, but even with the use of greedy algorithms it cannot be guaranteed that all possibly relevant conditions are indeed covered: conditions which are not formulated as targets for the search will only be present by chance. Since complete coverage is not practically attainable, corpus research must deal with missing data in one way or another. Attempts have been made to handle missing data by means of knowledge-based arithmetic models, including all relevant parameters; alternatively, `blind' statistical modelling techniques like CART can be used. There seems to be some preference for arithmetic models, unless one can guarantee that the missing data are not concentrated in a few subspaces (cf. =1 (
; Van Santen 1994) ).