Next: Corpus collection Up: Corpus design Previous: Specification of the

Specification of number and type of speakers

In addition to specifying the linguistic content of a corpus, the number and type of speakers in a corpus is the second major factor in specifying and classifying corpora. Due to their idiosyncratic characteristics speakers add substantially to the amount of variation present in a corpus.

Corpus size in terms of speakers

The number of speakers who are represented is one of the most important characteristics of a spoken language corpus. Based on the number of speakers in the corpus, speech corpora can be roughly divided in the following three classes:

Speech corpora with few speakers

Such corpora are often used in the development of speech synthesis systems. In most cases the speech of one or two persons (typically one man and one woman) is recorded. The corpus is used to prepare dictionaries of phonetic elements (allophones , diphones , etc.), and to design prosodic models. The speech material may consist of nonsense words in which sequences of phonetic elements are systematically varied, and a series of sentences to extract prosodic rules. For developing synthesis systems it is recommended to use experienced speakers. Especially when recording the material that serves for building the segment inventory it is extremely important that the speakers can keep pitch , loudness, voice quality and tempo constant.
Corpora comprising very few speakers are also common in basic speech research, especially where invasive measurements must be made. Corpora in this domain typically contain several additional signals recorded simultaneously with the acoustic speech signal, see e.g. the ESPRIT Basic Research Project on Articulatory Phonetics . The additional signals can range from the Electroglottogram (which was also recorded in part of the EUROM.1 corpus and the Transnational English Corpus) to subglottal pressure recorded via tracheal puncture and EMG activity of intrinsic laryngeal muscles. It should be emphasised that very few speakers does not necessarily imply a small corpus. For instance, when one needs to record one speaker producing all three-consonant clusters in languages like Dutch, English or German, in all possible phonetic contexts, within syllables , across syllable boundaries, across word boundaries, in stressed and unstressed syllables , at several positions in a sentence or in a prosodic contour, the amount of speech required is formidable, even when greedy search algorithms (cf. =1 (

; Van Santen 1992) ) are used to find the smallest possible number of sentences which comprise all contexts.

Similar remarks apply to intonation and prosody in general. If a text-to-speech system is developed that must be employed in many different applications (reading factual information in e.g. a train time table information system, or reading popular daily newspapers to blind subscribers), enormous amounts of speech are needed to capture all relevant prosodic phenomena.

Speech corpora with about 5 to 50 speakers

Speech corpora of this size are often used in experimental factorial research. The speech material can range from isolated nonsense words to a complete discourse, dependent on the specific application. Studies on prosody , for instance, would require linguistic units that exceed the word level. Speakers can be men, women, or children. The speech can be recorded under high quality laboratory conditions, but also `in the field'. In general, the number of speakers and the number of repetitions of the speech phenomena that are investigated should be large enough for a meaningful statistical processing if factorial experimental designs are planned. The power of a statistical test depends on the number of independent observations. If a corpus is developed for a factorial experiment , standard procedures are available and should be adhered to for determining the minimum number of speakers and/or the minimum number of utterances per speaker to allow planned statistical tests to reach a pre-specified power. These standard procedures can be found in most textbooks on statistics, such as =1 (

; Hayes 1963) , =1 (

; Ferguson 1976) , or =1 (

; Marascuilo Serlin 1988) ). In designing very large vocabulary speech recognition systems, on the other hand, one will strive for a maximally broad coverage of relevant phenomena, probably at the expense of high numbers of exact replications of specific (relatively rare) phenomena.

Speech corpora with more than 50 speakers

Speech corpora of this composition are necessary to adequately train and test speaker-independent recognition systems. Speakers can be men, women, or children, dependent on the application. The speech material can be limited to a list of isolated words or numbers, but it can also contain read aloud sentences and paragraphs or even spontaneous speech in the case of interactive dialogues . Speech may be recorded under laboratory conditions or in (quiet) offices, but if a telephone recognition system is involved, the speech corpus should, of course, consist of telephone speech both for the training phase and the testing phase.

General remarks

Of course, possible applications of corpora may be quite different from the typical ones listed above; some fundamental research may, for instance, require a very large speech corpus, whereas a simple recognition system may be developed with a rather small speech corpus. Furthermore, the list of applications of speech corpora given above is not meant to be exhaustive, but it should help to illustrate the large differences between speech corpora, depending on their research goal. Speaker Recognition is a branch of speech (technology) research which has received little attention in the past decade. This is reflected in the lack of publicly available corpora to support speaker recognition research. However, it is not completely true to say that there are NO corpora which are suitable for speaker recognition research; notable exceptions are the King corpus and the Switchboard corpus, both available though the LDC (cf. also the Proceedings of the ESCA workshop on Speaker Recognition in Martigny, April 1994). For a corpus to be suitable for speaker recognition research it is essential that speakers are recorded more than once, and that recordings are made at different days, in different realistic acoustic environments and with different microphones.

Speaker characteristics

How are the speakers for a speech corpus selected? Again, this strongly depends on the application one has in mind. For the development of a speech synthesis system experienced speakers, such as news readers or actors, are most appropriate. For the training and testing of recognition systems, on the other hand, the population of interest must be suitably sampled. There is no general agreement on the exact meaning of `suitable' in this context. One definition would amount to random sampling of the population of interest. This operationalisation usually results in different numbers of samples from subpopulations in the population of interest. For example, when the total population of army personnel is sampled, the subpopulation of women is likely to be poorly represented. In the case of the training and testing of a recognition system for the army, this female under-representation might seem to be acceptable, because the recogniser would have to deal mainly with male speakers. However, it may appear that some of the influential heavy duty users are women and then the recogniser should better be designed to handle the few but important women with the same performance as for men. In general, random sampling has the potential drawback that extremely large numbers of samples are needed to ensure that rare, but nevertheless important phenomena are included. When, where, and why rare phenomena may still be important depends on the application for which the corpus is collected. In the case of fundamental research, on the other hand, the aim is often to compare subpopulations in some respect, and then it would be more appropriate to draw an equal number of samples from all subpopulations of interest. Uniform sampling of all subpopulations of interest guarantees that all relevant variation is included in the corpus with the smallest possible number of speakers. The application for which the speech corpus is collected not only determines the best sampling strategy, but it also influences the choice of speakers. For example, speech processing often involves spectral analysis of the recorded speech. Several analysis techniques, such as pitch extraction or formant extraction , are less accurate for high-pitched voices (women and children) than for low-pitched voices (men). If such analysis techniques are used and the sex of the speakers is of no concern for the research goal, it would thus be sensible to select only men for the speech corpus. In general, however, it is recommended to include all possible types of speakers in a speech corpus, unless there are imperative arguments to exclude specific speaker groups. Specifically, it is strongly recommended to include equal numbers of females and males in each corpus. Speaker characteristics , which are potentially important and should therefore be considered when selecting the speaker population are described and discussed below.

Stable Transient speaker characteristics

The many speaker characteristics that may influence the speech signal can be divided in two main classes: relatively stable characteristics, and transient (temporary) characteristics. Stable speaker characteristics comprise on the one hand physiological and anatomical factors such as sex, age, weight, height, smoking/drinking habits, and possible pathologies, and on the other hand geographical and sociolinguistic factors. Transient (temporary) speaker characteristics cover factors such as a cold, or other mild afflictions of the speech organs, general physical condition (dependent on, for instance, the number of hours of sleep during the previous night), stress , and emotional state. Whereas transient speaker characteristics are very difficult to control, stable speaker characteristics are easier to take into account in the design of the speech corpus. For an overview of several important stable speaker characteristics, we refer to =1 (

; Scherer Giles 1979) . The most important stable speaker characteristics will be mentioned below.

Demographic coverage

Demographic factors form a very important set of relatively stable speaker characteristics which must be considered when designing sampling procedures for a corpus collection project. Each corpus should have sufficient demographic coverage. However, it is not always possible to determine all potentially relevant demographic factors a priori. Nor is the distribution of all factors in the total population always known. It is likely that the availability of detailed and reliable demographic data differs between the European countries. The availability of the data in lesser developed countries is even more questionable. In selecting speakers for inclusion in a corpus the possibility to assess certain characteristics is dependent on the recording protocol. If randomly selected speakers are recorded over the telephone, many personal characteristics cannot reliably be collected: self-report from the speaker is the only means of gathering the data.

Male/Female

Sex is known to have an enormous impact on speech quality. It is not well known at what age sex-related speech characteristics become prevalent. There is some evidence that sex-related speech characteristics are only partly due to physiological and anatomical differences between the sexes; cultural factors and sex role stereotypes also play an important role. Therefore, it is possible that the age at which sex-related differences become apparent differs between cultures and therefore between languages. See for general information on sex-related speech characteristics =1 (

; Smith 1979) , =1 (

; Coates 1986) , =1 (

; Philips Stelle Tanz 1987) , and =1 (

; Brouwer Haan 1987) ). For the time being, no definitive recommendations can be given with respect to the age above which sexes should be distinguished and sampled individually. Unless the contrary can be motivated from the specific application the corpus is collected for, each corpus should comprise approximately equal numbers of speakers of both sexes. For some applications, recordings of young children may also be required. Children should be considered as a `third sex', independent from adolescent or adult females and males. Speaker sex is known or suspected to affect at least four aspects of speech behaviour.

Pitch and intensity

Women are known to have higher average pitch than men. There are also indications that average intensity in female voices is somewhat lower than in male voices. Especially the higher pitch may affect spectral analysis techniques: pitch and formant extraction may be less accurate for high-pitched female voices than for low-pitched male voices. When a corpus is recorded to develop and test parameter extraction techniques, a realistic proportion of high-pitched female voices should be present. It should be realised that there is an interaction between sampling rate and the accuracy with which pitch frequency can be determined. In female and child speech even 20 kHz sampling frequency may not be high enough to obtain sufficient accuracy, as pitch frequencies may be as high as 500 to 750 Hz. Fortunately, sampling frequency can be increased using straightforward signal processing procedures whenever the need arises.

Overall spectral slope

Women are reported to tend more towards breathy a voice quality than males. It is not known whether this tendency is related to physiological and anatomical factors or that it is mainly due to culturally determined sex role stereotypes. Overall steeper spectral slope causes problems for some parametric signal processing techniques (e.g. formant extraction ).

Accuracy of pronunciation

Women are reported to adhere more to standard pronunciation than men. It is not known whether this finding, made for (American) English and Dutch, generalises to other languages. It remains to be seen whether sexrelated pronunciation variation is best modeled and described on the level of phonemic representations of words or on the level of the phonetic implementation of what is essentially the same phonemic form. Awaiting results of further research in additional languages/cultures this factor is probably not sufficiently important to attribute great importance to it. Moreover, this aspect is very difficult, if not impossible, to separate from other sex-related factors, and will therefore be duly represented as long as the sexes are adequately represented in the corpus. Variation in pronunciation accuracy may also be caused by factors related to age and social status.

Vocabulary and syntax

Sex-related differences in vocabulary and syntax are certainly culturally determined. Here, the factor sex interacts with factors like age and social status. Differences on the level of vocabulary and syntax are only relevant when spontaneous speech is being recorded. If all speech material exists of read utterances, vocabulary and syntax are completely determined by the prompting material.

Age

Although the impact of speaker age on speech behaviour has not received much attention in previous research, there are indications that age influences at least two aspects of speech behaviour ( =1 (

; Helfrich 1979) ).

Voice quality

There has been some research on the relation between age and voice quality. Most studies were concerned with the question whether speaker age can reliably be estimated from the speech signal alone. It seems that people are moderately good at guessing age from speech signal characteristics, although reported correlation coefficients may be mainly determined by the ability to discriminate between very young, very old and adult but non-senior groups. The exact signal characteristics which enable people to guess the speaker's age are not well understood; neither is it possible to estimate their impact on the performance of automatic speech and speaker recognition . Until the questions about the importance and the exact nature of the impact of age on speech signals have been answered, it is recommended that attempts be made to sample the relevant age groups. In doing so, a distinction should be made between the group under 20, the group between 20 and 60 and the group over 60. If relevant, the group under 20 should, of course be sub-divided into toddlers, children, adolescents and young adults. However, the exact ages separating these subgroups is subject of discussion. Moreover, in many respects mental and physiological maturation may be more important than calendar age.

Vocabulary and syntax

Here the considerations described above in the paragraph on the impact of sex on speech behaviour apply in exactly the same way. There is some literature suggesting that vocabulary and syntax of the older generation are different from the younger speakers, but apart from obvious observations that the subject spontaneously discussed by senior citizens tend to differ there is little hard data to support the claim that age is more important a factor than, for instance, social group and education level.

Weight and height

As with speaker age, most research in the past has concentrated on the question whether people can estimate speaker weight or speaker height from speech recordings alone ( =1 (

; Van Dommelen 1993) ). It appears that people are moderately successful in this task. It will be clear that weight and height of speakers are highly correlated. The exact signal characteristics that enable people to guess the speaker's weight and height are not known. In a sufficiently large sample of speakers, most weight/height groups will probably be represented.

Smoking and drinking habits

Several investigations have shown that voice quality can change under the influence of smoking or the use of alcohol ( =1 (

; Gilbert Weismer 1974) ). One of the most common consequences of smoking and drinking is premature ageing of the mucous membrane covering the vocalis muscle, resulting in a hoarse voice quality. Excessive drinking may eventually result in brain damage, which may in turn lead to severe speech disorders. The use of drugs can have a similar effect. In those cases it would be more appropriate to speak of pathological speech.

Pathological speech

The boundary that divides pathological speech from non-pathological speech is very difficult to draw. Hoarseness due to smoking can be regarded as a very mild speech disorder, whereas more severe speech disorders include, for instance, paralysis of the vocal cords and aphasia . Speech disorders can be divided into two main classes: those where there is a clear organic (anatomical, physiological, neurological) cause, and those where there is not. The latter category is usually referred to as functional disorder . However, in many cases there is no clear cut distinction between organic and functional speech disorders ; often both types are involved, or it is unclear which of the two types is involved. Speech disorders can be described at five different levels:

Articulation disorders
This involves the distortion, deletion , or substitution of sounds or sound combinations. Usually such disorders are functional , but they may also result from lesions of the lips (e.g., a cleft lip), the palate (a cleft palate), the teeth, the tongue, the jaw, or the nose. Another possible cause of articulatory disorders is dysarthria , a damage to the central or peripheral nervous system, manifested by neuromuscular disability.
Resonance disorders
This involves lesions of the oral, nasal , or laryngeal cavities. Apart from functional causes, resonance disorders can result from, for instance, surgical removal of the tonsils, a cleft palate, or nose polyps.
Voice disorders
This involves lesions of the vocal cords, referred to as dysphonia . The voice may emerge as a whisper (no vocal-cord vibration), for instance due to paralysis; or vocal-cord vibration may be present to some degree, but accompanied by excessive air flow (a `breathy' voice); or there may be irregular and therefore aperiodic vocal fold vibration, for instance due to the growth of abnormal tissue (nodules) on the vocal folds resulting in a `hoarse' voice quality. Dysphonia may be caused by psychological and emotional factors, such as a severe shock, or by organic factors. A serious voice disorder is cancer of the vocal cords, which may lead to the surgical removal of the larynx (laryngectomy). Although the patients can learn alternative voicing mechanisms, their speech is usually severely degraded.
Language disorders
This involves disorders that do not affect the production of the speech message, but rather its content. These disorders are usually classified under the name aphasia . Patients suffering from aphasia may, for instance, use a reduced and incomplete sentence structure, have difficulty in wordfinding, use an inappropriate intonation , or make erratic pauses. The cause of aphasia is brain damage due to, for instance, a stroke, thrombosis, a tumour, or excessive drinking.
Rhythm disorders
The usual terms to describe the main rhythm disorders are stuttering (or stammering) and cluttering . Stuttering is a very complex phenomenon that is characterised by, for instance, a repetition of speech segments , abnormal prolongations of sound segments , words being unfinished, or circumlocutions to avoid types of sound that cause problems. Stuttering varies enormously from person to person and from situation to situation. It is, for instance, well known that stutterers almost never stutter when they are singing. Both organic (genetic) causes and functional (environmental ) causes are assumed to underlie the stuttering phenomenon. Another major category of nonfluency is cluttering . The primary characteristic here is that the patient tries to talk too quickly, and as a result introduces distortions into his rhythm and articulation. The description and theoretical study of cluttering is less advanced than that of stuttering . In addition, there is a considerable overlap between the categories of stuttering and cluttering .

For many purposes it is most appropriate to build speech corpora with a large variety of speakers. However, the speaker variability should be kept within reasonable bounds. Severely pathological speech will, in general, deviate substantially from `normal' speech and thus it is usually not desirable to include this type of speech in a normal speech corpus. On the other hand, speakers with mild pathological disorders, such as hoarseness , can be included in for instance speech corpora designed for recognition.
Of course, research might focus specifically on pathological speech, for instance when a recogniser is developed for use as an environmental control device for handicapped persons. In that case pathological speech should of course be amply represented in the speech corpus. Pathological speech should also be present in a corpus designed to cover as much speaker variation as possible (a kind of `all-purpose' speech corpus). A more elaborate discussion of pathological speech can be found in =1 (

; Perkins 1977) and =1 (

; Crystal 1980) ).

Professional Untrained speakers

Professional speakers should be selected when recording very large corpora with very few speakers, for instance to develop text-to-speech systems . The major reason to prefer professional speakers for this purpose is their ability to keep pitch , intensity and speech rate constant, not only during one recording session, but also over several sessions, which may have to be scheduled on different days, perhaps spread over several weeks or even months. One possibly important drawback of using professional speakers must be emphasised: more often than not, professional speakers are not really representative of the `normal' speech behaviour in the community. If the corpus is collected for the development of a text-to-speech system this may not be a problem. However, linguistic and phonetic findings based on a corpus comprising only speech of a small number of highly trained professional speakers should not be generalised without extreme caution.

Geographical and sociolinguistic factors

It is well known that both the regional and the sociolinguistic background of speakers can have a large effect on their speech. People speak differently dependent on the specific region(s) in which they were brought up, and dependent on factors such as the linguistic background of the parents, social status, and education level. It is widely assumed that the high-school period is most decisive for the regional or dialectal colouring in one's speech. Therefore it is strongly recommended to obtain information about the high-school period when collecting data about the speaker's background.
Dialectal speech or regional/dialectal colouring of the prestige variant of a language (like Received Pronunciation (RP) in British English or Hochdeutsch in Germany) are known to be perhaps the most important source of speaker-related variation. Not all languages have a widely accepted and well documented pronunciation standard, like RP in English. Given the enormous amount of literature on Dialectology one would assume that the impact of dialects on standard speech is well understood. Unfortunately, this does not appear to be the case. Linguists and dialectologists appear to disagree about the number of major dialects in a language area, and about the boundaries between the areas where a specific dialect is spoken. Moreover, the majority of the dialect studies were based on written questionnaires. Although there are large amounts of recorded dialectal speech stored in the national Dialectology institutes, these recordings do not qualify as corpora, because they exist only on analogue tapes, with little or no detailed annotation . In collecting new corpora the factor regional/dialectal colouring should be properly accounted for. However, since the basic data to determine number of dialects and dialect boundaries are difficult to obtain and probably not always reliable, it is recommended that dialect is operationalised by geographic region. If necessary, processing of the corpus data can yield post hoc data on dialect differences. However, it has appeared that post hoc determination of the dialect background of a speaker as part of the transliteration /transcription process poses big difficulties. There is one additional factor which complicates the procedures for sampling dialectal influence, viz. the increasing mobility of the population. It is acknowledged that the impact of mobility is different between language areas and between countries. However, in sampling for a number of large telephone speech corpora in the U.S.A. (POLYPHONE ; Voice Across America) a special `dialect ' called Army Brat was defined, for those speakers who had lived for short periods of time in many different parts of the country. It should be noted that the factor dialect not only affects pronunciation. More often than not, its impact on vocabulary and perhaps also on syntax is at least as important. Of course, the impact on vocabulary etc. can only come to light in corpus collection paradigms which allow the speaker to select his own words. In corpora comprising only read speech this factor should have no effect. Sociolects can be regarded as dialects spoken by a particular social class. A clear distinction between different social classes exists, for instance, in India, where each member of the society belongs to a specific caste. However, in most cultures it is very difficult to distinguish between social classes. The division into three categories lower-class, middle-class, upper-class seems to be most widely accepted for Western cultures. Elaborate schemes have been designed to determine a person's social class using factors such as education, profession, and income. In addition to social class membership a person's sociolect is, of course, also influenced by the linguistic background of the parents and the dialect regions in which he grew up. As is the case for dialects , sociolects may influence not only pronunciation, but also syntax and vocabulary. It is recommended that sociolects should be properly accounted for when collecting new corpora. It has been found that the impact of sociolects on speech behaviour strongly interacts with speaking style . Thus, the speech of a pipe-fitter who speaks in a formal way, may resemble the speech of a salesman who speaks in a casual way (cf. ( =1 (

; Labov 1972) ). This phenomenon probably also applies to regional dialects .
There is considerable uncertainty on how to treat dialects , regiolects and sociolects in corpora collected to develop speech technology, e.g. to develop connected speech recognition systems for use in telephone information systems. There may be large differences between countries and cultures in what is most appropriate in this respect. Of course, each operational recognition system should be able to handle the range of dialectal and regiolectal influences present in the speech of upper and middle class speakers produced in somewhat formal situations. The extent to which dialectal influences occurring in less formal speech, or in formal speech of lower class speakers must also be covered will depend very much on the application for which the recogniser is being developed. Another extremely important factor is the social acceptability to use strongly dialectal speech in a given situation. Acceptance is likely to differ strongly between regions in a given country.
If telephone applications are designed in such a way that all calls originating from a specific part of the country are handled in a local centre, one may envisage recognition systems which are adapted to the local regiolect/dialect , provided that suitable training corpora can be collected. When collecting speech corpora over the telephone by soliciting input from randomly selected subjects one should specify strict guidelines for deciding whether or not a specific speaker deviates too much from the `standard' language for him to be included in the corpus. The speech of non-native speakers can be regarded as a special `sociolect '. Some non-native speakers may speak the standard language of the country they reside in with only a slight accent, whereas others may speak the standard language with a very marked accent and/or a poor control over vocabulary and grammar . There seems to be no reason to exclude the former group of non-native speakers from a common speech corpus, whereas the latter group of non-native speakers would preferably be excluded, unless the research is specifically aimed at non-native speech or one wants to build an `all-purpose' speech corpus.

)

Next: Corpus collection Up: Corpus design Previous: Specification of the

WWW Administrator
Fri May 19 11:53:36 MET DST 1995