Next: Specification of number Up: Corpus design Previous: Applications of spoken

Specification of the linguistic content

Different types of speech data

The speech material in a corpus can vary from isolated sounds to complete conversations. In general, the extent to which the experimenter has control over the speech material decreases as it becomes more and more spontaneous and natural. The term natural refers to a rather intuitive concept that can be interpreted in different ways. We regard speech to be maximally natural when two or more speakers have a conversation in a familiar environment about a subject they themselves choose to talk about, since this is the situation for which speech was `invented'. Although read aloud speech is a commonly used speaking style (and may be regarded as a natural speaking style from a sociolinguistic point of view), we regard this style as derived from the most natural style mentioned above. When reading a text, people have the tendency to speak more formally and to articulate more carefully than when they are involved in free conversation. Thus, in our opinion the naturalness of speech should be judged on a gradual scale. It should be noted that control over the speech material is not always necessary and may even be counterproductive, especially when one wants to study the variation of speech as a function of communicative context. However, strict control over the speech material is required for some applications, such as the development of speech synthesis systems.
In the following paragraphs eight types of speech data will be distinguished.

Read aloud isolated phonemes

Vowels pronounced in isolation (or in a `neutral' context, such as /ht/) are often used as frame of reference for experiments in which vowels from connected speech are investigated. Continuant consonants, e.g. /l, r, w, j, n, m, s, f/, can also be pronounced in isolation. Non-continuants, e.g. /p, t, k, b, d, g/, must be followed or preceded by a vowel, e.g. the `neutral' schwa .

Read aloud isolated words

Isolated words can be either `nonsense' words or existing words. In the case of nonsense words the experimenter can create all possible kinds of phonotactically correct sound sequences. This gives the opportunity to study coarticulation in a systematic way. Nonsense words are also used to extract models for a dictionary of phonetic elements when a synthesis system is developed. When existing words are used, the number of possible sound sequences is restricted to what is phonotactically appropriate in the lexicon of a given language. It must be realised that control over the sounds produced by the speakers may not be perfect, because the pronunciation of polysyllabic words can be influenced by the stress pattern, which may be ambiguous (cf. words like record in English).
When speakers have to read aloud a list of isolated words, their pronunciation may be influenced by the orthographic representation of the words, a phenomenon known as spelling pronunciation. Spelling pronunciation is especially apparent in languages which form nominal compounds ; if sound sequences occur across the morpheme boundaries that would lead to assimilation and degemination in connected speech, one should still anticipate that in reading aloud all sounds are realised. This phenomenon can be circumvented by having the speakers name the words through the presentation of pictures, but this method can only be applied to a very limited number of words. It is, for instance, not suitable for abstract concepts.

Read aloud isolated sentences

The carrier sentence is one type of an isolated sentence. Carrier sentences are often used when one wants to get a somewhat more natural pronunciation of (nonsense) words in comparison with words spoken in isolation, especially with respect to speech rate. The test words are embedded in the carrier sentence, as illustrated by the example ``I will say -- a test word -- again''. The same carrier sentence is used repeatedly for all occurring test words, so that the influence of the acoustic and linguistic context on the test words is controlled.
More natural speech material can be obtained when `normal' (linguistically meaningful) sentences are constructed by the experimenter. Such sentences can be used to train phoneme based recognisers and to study, for instance, word stress or coarticulation in a relatively natural linguistic context. One should note that an isolated sentence may be interpreted by a speaker in a wider semantic context, which can influence the pronunciation of the sentence, especially with respect to the position of sentence accent(s) . Sometimes a semantic relation between subsequent `isolated' sentences may arise as a result of the specific ordering of the speech material. Since individual speakers may imagine a different semantic context for a specific sentence, variability in the suprasegmental features of the test sentences can occur. If desired, this variability can be reduced by using punctuation and other typographical means (for instance, capitals or boldface characters) to indicate words that should have a sentence accent . A more natural way of doing this, is to let each sentence be preceded by a question that evokes sentence accents at the desired positions. It should be clear, however, that neither practice can be recommended in the collection of large corpora of telephone speech.
For many purposes, such as the development of a phoneme-based recogniser, it is crucial that all phonemes are represented in the speech corpus in sufficiently high numbers. Due to the large differences in frequency of occurrence of the phonemes in the language in general, uniform phoneme frequencies will not obtain in randomly chosen sentence material: such material will, instead, reflect the differences in phoneme frequencies. It is proposed to reserve the term phonetically balanced for speech material containing phonemes according to their frequency of occurrence in the general language. Phonetically balanced sentences may be used for speech audiometry and for testing the transmission characteristics of communication channels or public address systems.
Approximately uniform phoneme frequency distributions can be achieved by using phonetically rich sentences. For that purpose greedy algorithms ( =1 (

; Van Santen 1992) ) can be used. Suppose you want to have a set of sentences in which each phoneme of the language of interest occurs at least once. Of course, you could try to create this set of sentences yourself, but this would be difficult and time-consuming. Furthermore, you might end up with sentences that look rather `constructed'. An alternative would be to search for an appropriate set of sentences in a sufficient large text corpus, for instance, a large amount of newspaper data on CD-ROM. A greedy algorithm can be used to obtain the minimum number of sentences containing all phonemes. The following steps have to be taken to get the desired test set:

Use a grapheme-to-phoneme converter in order to be able to search for phonemes instead of characters.
Select the sentence in the corpus with the largest number of phonemes, not counting phonemes that are repeated within the same sentence.
Select each next sentence as the one with the largest number of phonemes that have not yet been covered. Stop this procedure when the entire set of phonemes has been covered.

To obtain more occurrences of each phoneme, the procedure described above can be repeated any number of times with the remaining sentences in the text corpus. Of course, the greedy algorithm can also be used for other basic units than phonemes, for instance, characters, diphones, vowels in specific consonantal contexts, subsets of words, or specific discourse units. The greedy algorithm can be amended in various ways. For example, one can maximise coverage of high frequency units by using the sum of the frequencies of the units in a sentence as criterion. This may be important when complete coverage of all units is impossible, in which case one likes to cover the most frequent units first. Furthermore, all kinds of constraints can be imposed on the sentences that are selected, for instance with respect to their length, word material, syntactic structure, etc. Note that you can also choose other contexts for the basic units than sentences, such as clauses, or words. For example, you might want to search the text corpus for the minimal set of words in which each phoneme occurs at least once.
It should be clear that very large text corpora may be needed to obtain a sufficient number of phonetically rich sentences. In some corpora phonetically rich does not only mean that an attempt has been made to obtain uniform phoneme frequencies, but also uniform diphone or triphone frequencies. When designing phonetically rich sentences intended to be read by members of the general public (e.g. in the POLYPHONE corpora) care must be taken to avoid very long sentences, because these are extremely difficult to read aloud. Moreover, all sentences must be checked for very rare words (which are likely to cause reading problems) and for contents which are potentially insulting. In POLYPHONE , candidate sentences had to contain at least four words and a maximum of 80 characters, including spaces and punctuation marks. The latter condition guarantees that the prompting text will not span more than two lines on the prompting sheet. Selection of sentences on the basis of length and phonemic contents can be done automatically. However, checking for insulting contents must be done manually.

Read aloud text fragments

The naturalness of the produced speech may increase even more when speakers read aloud a series of sentences that are semantically related, provided that the subject is able and used to read aloud paragraph length material. The prompting material can consist of a text fragment taken from, for instance, a newspaper or a book. But the text fragment can also be created by the experimenter, when it is necessary to impose some specific restrictions on the speech material, for instance with respect to phonemic structure, word structure, or syntactic structure. Reading aloud a text fragment is more difficult than reading aloud a list of isolated sentences. It is very likely that the speech produced by different speakers, who are asked to read a text fragment, will vary considerably, especially with regard to aspects like vividness, speech rate, omitted speech segments , prosody , etc. The preferred position of sentence accents in a text fragment can be indicated with capitals or boldface characters. This is not recommended if one is interested in more natural speech.

Semi-spontaneous speech

When speech corpora are gathered for commercial applications, a common task of speakers is to read numbers or alpha-numerical expressions, such as ZIP-codes . Speakers have to some extent the freedom to pronounce these numbers or alpha-numerical expressions as they like. For example, there appear to be substantial differences between the ways in which subjects express telephone numbers. Some may read the telephone number as a string of digits, whereas others may read it as a string of numbers containing two or more digits. In addition, it may make a difference whether the telephone number is familiar (for instance, a friend's number), or unfamiliar. The POLYPHONE corpora are good examples of corpora that contain such semi-spontaneous speech .

Spontaneous speech about a predetermined subject

The previous types of speech material were all concerned with the reading aloud of some piece of text by one speaker at the time (disregarding the naming of words through the presentation of pictures). In the present section we will discuss spontaneous speech from one or more speakers. The major difference between read speech and spontaneous speech is that the former fixes vocabulary and syntax , whereas the latter leaves speakers free to choose their own vocabulary and syntax . The naturalness of the produced speech increases when speakers are allowed to choose their own words. In order to keep some control over the speech material, the experimenter can determine the subject the speaker has to talk about.
The subject of conversation is relatively fixed when speakers are asked to retell a story that they heard or read shortly before. Since it is likely that speakers will use at least some of the words that occurred in the story, this method allows the experimenter to gather `spontaneously' spoken versions of specific words of interest. In a variant of this method, speakers can be asked to invent a story based on a cartoon (without text balloons), or on some complex picture that is bound to evoke the words of interest. In all these designs, monologues are involved, although a session manager may try to guide the discourse in the desired direction. However, one should be aware that many naive subjects do not feel at ease in a situation in which they must maintain a monologue for an extended period of time. Most people feel much more comfortable in a dialogue situation. Moreover, interview situations provide some additional control over subjects' speech, because the interviewer determines the subject of conversation, and subsequently guides the conversation in the desired direction.
Another kind of guided spontaneous speech is an information dialogue: people who attempt to obtain information about, for instance, train or plane schedules. Speakers ask information from an information agent or a computer system about time and place of departure, destination, etc. In this way spontaneous speech can be obtained, even if it concerns a very restricted subject. This paradigm is used in the (D)ARPA Air Line Travel Information System (ATIS) task. Train time table information dialogues are now being recorded in several languages, e.g. German, Dutch, Italian, etc. in the MLAP projects MAIS and RAILTEL.
Although a speech situation with two or more people is more natural than a monologue , overlapping acoustic material may result from several people speaking simultaneously. For some applications, such as research on basic speech processes, overlapping acoustic material is difficult or impossible to use. Of course, when dialogues were recorded one could try to extract speech fragments in which only a single speaker is talking. The study of simultaneous speech from two or more speakers is important for research on dialogue/discourse analysis , intention analysis, and spoken language understanding. The gathering of multiple simultaneous speaker corpora is still in its infancy. Such corpora seem indispensable to study speech in all its relevant aspects. In addition, speech recognisers, which are up to now only able to deal with one speaker at a time, would eventually also have to be able to deal with different speakers talking simultaneously. Speech corpora containing dialogues could supply the training and testing data for such advanced recognisers. To make such corpora useful for research and development purposes each individual speaker should be recorded on a separate track, using a microphone array with very high directional sensitivity. Additional tracks could then be synthesised, simulating less perfect directional sensitivity. Alternatively, subjects could be recorded in a teleconference, although such distributed recordings would require extensive precautions to allow one to synchronise the tracks originating from completely independent recorders.
A special type of information seeking dialogue, which is becoming increasingly important, is the one between a human and a computer. In order to gain a clear insight into the way people behave when they have to interact with computers, in the absence of computers that can entertain such a conversation, the Wizard of Oz technique was invented. This technique will be briefly described in the next paragraph.

The Wizard of Oz technique

In the children's novel The Wizard of Oz ( =1 (

; Baum 1900) ) a young boy is bullied by an oracle called the Wizard of Oz. The crux of the story is that the Wizard of Oz turns out to be nothing more than a device operated by a man. In the Wizard of Oz technique a human plays the role of the computer in a simulated human-computer interaction. Of course, the easiest way to learn about the way humans behave when they have to interact with computers would be to actually have them interact with a computer. However, in order to be able to build a computer system that can participate in a dialogue with a human, one has to know how a human-computer interaction is likely to proceed. The Wizard of Oz technique can be seen as an intermediate step in the design of such a computer system. Because the subjects who participate in a Wizard of Oz experiment have to be convinced that they are actually talking to a computer, some precautions must be taken. For example, the wizard simulating the computer should be talking with a `computer voice' (in the case that spoken output is required), and the wizard should also make deliberate errors similar to the ones that a computer could be expected to make in the application of interest.
As Spoken Language Systems are rapidly approaching a performance level that is acceptable for an increasing range of applications, it seems likely that man-machine dialogue systems will be used more and more in the near future. For the development of such systems speech data gathered in Wizard of Oz experiments will be indispensable, as long as at least one part of the system is not yet good enough for experiments with large groups of users. A more comprehensive discussion of the Wizard of Oz technique is given in chapter 14.
As the performance of SLSs improves, the development of new applications will be increasingly based on pilot experiments with a system in the loop, i.e., with test versions of the application in which the wizard is replaced by a computer system which has enough functionality to support the human-machine interaction.

Spontaneous speech

This type of speech material, in which speakers are allowed to freely choose their own words and their own subject of conversation is most natural, especially in a dialogue situation. Most remarks that were made in the previous paragraph, also apply to the present one. As with all natural processes, the observer's paradox can play a role in the recording of spontaneous speech. That is, in order to obtain speech that is as natural as possible, the researcher has to observe how people speak when they are not being observed ( =1 (

; Labov 1972) ). To overcome this methodological paradox, several techniques have been proposed throughout the development of sociolinguistic research ( =1 (

; Argente 1991) ):

Trying to give subjects the impression that observation is absent
Subjects are usually unaware of being observed when they are (surreptitiously) recorded before or after the actual experimental procedure (reading sessions, interviews, or whatever). Also (planned) interrupting phone calls, or interruption by a third person during the recording session can give a subject the impression that observation has temporarily stopped.
Asking subjects about emotional events in their life
The investigator can, for instance, ask subjects to remember occasions when they were in physical danger. Speech data available from radio or television interviews with witnesses of an actual disaster (a plane crash, an earthquake, etc.) could contain realistic samples of emotional speech. This technique is based on the belief that emotional involvement triggers unattended forms of speech.
Using a member of the subject's own community for interviews or conversations
This technique is based on the belief that people tend to talk in a more casual way to people with a similar social and/or geographical background. Especially recordings of people who are friends in everyday life, are likely to contain samples of the vernacular .
Focussing research on adolescents
This strategy is based on the belief that adolescents have not yet developed the full-fledged range of speech behaviour and communicative competence characteristic of sophisticated adults and that adolescents therefore may be inclined to use a more casual speech style even in the presence of a recording device.

Factorial experiments and corpus studies

Experimental speech research has traditionally been focussed on factorial experiments , that is, experiments in which a number of factors are defined that are hypothesised to influence some aspects of the speech behaviour, in production or in perception. The amount of speech in these experiments has typically been small, if only because it was practically impossible to record large amounts of speech in production experiments or to generate large amounts for perception experiments. The major causes of the limitations were in the tight control of the speech needed for well designed factorial experiments and in the time required from the subjects. Tight control is necessary to prevent the outcome of factorial experiments of being meaningless: this type of experiment requires that all conceivable factors different from the small number under study be kept constant, whereas the experimental factors are being varied over a limited range. It is not our intention to criticise factorial experiments , if only because they have contributed to virtually all the knowledge we have about speech and because until recently there was hardly an alternative. Yet, it must be acknowledged that, exactly due to the tight control, the speech used in the older experiments may not have been exactly `communicative'. In the majority of the cases the subjects performed in situations which are quite remote from normal communicative behaviour; therefore, some caution should be exercised in generalising the results of controlled experiments to `normal communicative' speech.
Another reason to be careful in interpreting results of factorial experiments is the possibility that the experimenter did not completely succeed in keeping all non-experimental factors constant: it may be the case that nonexperimental factors did co-vary with experimental ones, thereby being responsible for at least part of the effects attributed to the experimental factor(s). One case in point is intonation research, that has been pretty much focussed on pitch and on duration effects. There is, however, increasing evidence that other factors like spectral structure , spectral slope , spectral dynamics , etc. also play a role, and perhaps one that is quite important. In short: there is a danger that factorial experiments lead to overestimating the impact of the factors under investigation, at the cost of factors that were supposed to be constant, but that actually co-varied so as to enforce the effects of the experimental factors.
Now that very large corpora become available it is possible to set up another type of experiment, in which the behaviour of one or more specific factors is investigated in a very large, perhaps comprehensive number of different contexts. Instead of trying to neutralise the effect of concomitant factors by trying to keep them constant (which will normally mean that one of the many different levels of such factors are selected, e.g. a voiceless stop as the right neighbour of the phonemes under study, or only syllables which have a prominence lending High-Low pitch contour), one may try instead to sample many different contexts. Of course, in order to make this type of research feasible, one has to assume that subject effects can be treated in exactly the same way as context effects, because it will still be extremely difficult to have subjects perform for very long periods of time. In designing corpus based experiments one must be aware of the extreme skewing of many frequency distributions observed in spoken language. For instance, in all languages for which data on phoneme frequencies are available it has appeared that within a system some phonemes occur much more often than other phonemes . Random sampling would leave one with a very high likelihood of missing infrequent phonemes and of missing possible contexts, unless the total corpus is made excessively large. Greedy algorithms (cf. =1 (

; Van Santen 1992) ) can be used to find the minimum amount of linguistic material that covers a maximum number of phenomena, but even with the use of greedy algorithms it cannot be guaranteed that all possibly relevant conditions are indeed covered: conditions which are not formulated as targets for the search will only be present by chance. Since complete coverage is not practically attainable, corpus research must deal with missing data in one way or another. Attempts have been made to handle missing data by means of knowledge-based arithmetic models, including all relevant parameters; alternatively, `blind' statistical modelling techniques like CART can be used. There seems to be some preference for arithmetic models, unless one can guarantee that the missing data are not concentrated in a few subspaces (cf. =1 (

; Van Santen 1994) ).

Next: Specification of number Up: Corpus design Previous: Applications of spoken

WWW Administrator
Fri May 19 11:53:36 MET DST 1995