In addition to specifying the linguistic content of a corpus, the number and type of speakers in a corpus is the second major factor in specifying and classifying corpora. Due to their idiosyncratic characteristics speakers add substantially to the amount of variation present in a corpus.
The number of speakers who are represented is one of the most important characteristics of a spoken language corpus. Based on the number of speakers in the corpus, speech corpora can be roughly divided in the following three classes:
Such corpora are often used in the development of speech
synthesis systems. In most cases the speech of one or
two persons (typically one man and one
woman) is recorded. The corpus is used
to prepare dictionaries of phonetic elements (allophones ,
diphones , etc.), and to design prosodic models.
The speech material may consist of
nonsense words in
which sequences of phonetic elements are systematically varied, and a series
of sentences to extract prosodic rules. For developing
synthesis systems it is recommended to use experienced
speakers. Especially
when recording the material that serves for building the
segment inventory it is extremely important that the speakers
can keep pitch , loudness, voice quality and tempo constant.
Corpora comprising very few speakers are also common in basic speech research,
especially where invasive measurements must be made. Corpora in this
domain typically contain several additional signals recorded
simultaneously with
the acoustic speech signal, see e.g. the ESPRIT Basic
Research Project on Articulatory Phonetics . The
additional signals can range from the
Electroglottogram (which was also recorded in part of
the EUROM.1
corpus and the Transnational English Corpus) to subglottal
pressure recorded via tracheal puncture and EMG activity of intrinsic laryngeal muscles. It should be
emphasised that very few speakers does not necessarily
imply a
small corpus. For instance, when one needs to record one speaker producing
all three-consonant clusters in languages like Dutch,
English or German, in all possible phonetic contexts, within
syllables , across syllable
boundaries, across word boundaries, in stressed and
unstressed syllables , at several positions in a
sentence or in a
prosodic contour, the amount of speech
required is formidable, even when greedy
search algorithms (cf.
=1 (
; Van Santen 1992) ) are used to find the smallest possible number of sentences which comprise all contexts.
Similar remarks apply to intonation and prosody in general. If a text-to-speech system is developed that must be employed in many different applications (reading factual information in e.g. a train time table information system, or reading popular daily newspapers to blind subscribers), enormous amounts of speech are needed to capture all relevant prosodic phenomena.
Speech corpora of this size are often used in experimental factorial research. The speech material can range from isolated nonsense words to a complete discourse, dependent on the specific application. Studies on prosody , for instance, would require linguistic units that exceed the word level. Speakers can be men, women, or children. The speech can be recorded under high quality laboratory conditions, but also `in the field'. In general, the number of speakers and the number of repetitions of the speech phenomena that are investigated should be large enough for a meaningful statistical processing if factorial experimental designs are planned. The power of a statistical test depends on the number of independent observations. If a corpus is developed for a factorial experiment , standard procedures are available and should be adhered to for determining the minimum number of speakers and/or the minimum number of utterances per speaker to allow planned statistical tests to reach a pre-specified power. These standard procedures can be found in most textbooks on statistics, such as =1 (
; Hayes 1963) , =1 (
; Ferguson 1976) , or =1 (
; Marascuilo Serlin 1988) ). In designing very large vocabulary speech recognition systems, on the other hand, one will strive for a maximally broad coverage of relevant phenomena, probably at the expense of high numbers of exact replications of specific (relatively rare) phenomena.
Speech corpora of this composition are necessary to adequately train and test speaker-independent recognition systems. Speakers can be men, women, or children, dependent on the application. The speech material can be limited to a list of isolated words or numbers, but it can also contain read aloud sentences and paragraphs or even spontaneous speech in the case of interactive dialogues . Speech may be recorded under laboratory conditions or in (quiet) offices, but if a telephone recognition system is involved, the speech corpus should, of course, consist of telephone speech both for the training phase and the testing phase.
Of course, possible applications of corpora may be quite different from the typical ones listed above; some fundamental research may, for instance, require a very large speech corpus, whereas a simple recognition system may be developed with a rather small speech corpus. Furthermore, the list of applications of speech corpora given above is not meant to be exhaustive, but it should help to illustrate the large differences between speech corpora, depending on their research goal. Speaker Recognition is a branch of speech (technology) research which has received little attention in the past decade. This is reflected in the lack of publicly available corpora to support speaker recognition research. However, it is not completely true to say that there are NO corpora which are suitable for speaker recognition research; notable exceptions are the King corpus and the Switchboard corpus, both available though the LDC (cf. also the Proceedings of the ESCA workshop on Speaker Recognition in Martigny, April 1994). For a corpus to be suitable for speaker recognition research it is essential that speakers are recorded more than once, and that recordings are made at different days, in different realistic acoustic environments and with different microphones.
How are the speakers for a speech corpus selected? Again, this strongly depends on the application one has in mind. For the development of a speech synthesis system experienced speakers, such as news readers or actors, are most appropriate. For the training and testing of recognition systems, on the other hand, the population of interest must be suitably sampled. There is no general agreement on the exact meaning of `suitable' in this context. One definition would amount to random sampling of the population of interest. This operationalisation usually results in different numbers of samples from subpopulations in the population of interest. For example, when the total population of army personnel is sampled, the subpopulation of women is likely to be poorly represented. In the case of the training and testing of a recognition system for the army, this female under-representation might seem to be acceptable, because the recogniser would have to deal mainly with male speakers. However, it may appear that some of the influential heavy duty users are women and then the recogniser should better be designed to handle the few but important women with the same performance as for men. In general, random sampling has the potential drawback that extremely large numbers of samples are needed to ensure that rare, but nevertheless important phenomena are included. When, where, and why rare phenomena may still be important depends on the application for which the corpus is collected. In the case of fundamental research, on the other hand, the aim is often to compare subpopulations in some respect, and then it would be more appropriate to draw an equal number of samples from all subpopulations of interest. Uniform sampling of all subpopulations of interest guarantees that all relevant variation is included in the corpus with the smallest possible number of speakers. The application for which the speech corpus is collected not only determines the best sampling strategy, but it also influences the choice of speakers. For example, speech processing often involves spectral analysis of the recorded speech. Several analysis techniques, such as pitch extraction or formant extraction , are less accurate for high-pitched voices (women and children) than for low-pitched voices (men). If such analysis techniques are used and the sex of the speakers is of no concern for the research goal, it would thus be sensible to select only men for the speech corpus. In general, however, it is recommended to include all possible types of speakers in a speech corpus, unless there are imperative arguments to exclude specific speaker groups. Specifically, it is strongly recommended to include equal numbers of females and males in each corpus. Speaker characteristics , which are potentially important and should therefore be considered when selecting the speaker population are described and discussed below.
The many speaker characteristics that may influence the speech signal can be divided in two main classes: relatively stable characteristics, and transient (temporary) characteristics. Stable speaker characteristics comprise on the one hand physiological and anatomical factors such as sex, age, weight, height, smoking/drinking habits, and possible pathologies, and on the other hand geographical and sociolinguistic factors. Transient (temporary) speaker characteristics cover factors such as a cold, or other mild afflictions of the speech organs, general physical condition (dependent on, for instance, the number of hours of sleep during the previous night), stress , and emotional state. Whereas transient speaker characteristics are very difficult to control, stable speaker characteristics are easier to take into account in the design of the speech corpus. For an overview of several important stable speaker characteristics, we refer to =1 (
; Scherer Giles 1979) . The most important stable speaker characteristics will be mentioned below.
Demographic factors form a very important set of relatively stable speaker characteristics which must be considered when designing sampling procedures for a corpus collection project. Each corpus should have sufficient demographic coverage. However, it is not always possible to determine all potentially relevant demographic factors a priori. Nor is the distribution of all factors in the total population always known. It is likely that the availability of detailed and reliable demographic data differs between the European countries. The availability of the data in lesser developed countries is even more questionable. In selecting speakers for inclusion in a corpus the possibility to assess certain characteristics is dependent on the recording protocol. If randomly selected speakers are recorded over the telephone, many personal characteristics cannot reliably be collected: self-report from the speaker is the only means of gathering the data.
Sex is known to have an enormous impact on speech quality. It is not well known at what age sex-related speech characteristics become prevalent. There is some evidence that sex-related speech characteristics are only partly due to physiological and anatomical differences between the sexes; cultural factors and sex role stereotypes also play an important role. Therefore, it is possible that the age at which sex-related differences become apparent differs between cultures and therefore between languages. See for general information on sex-related speech characteristics =1 (
; Smith 1979) , =1 (
; Coates 1986) , =1 (
; Philips Stelle Tanz 1987) , and =1 (
; Brouwer Haan 1987) ). For the time being, no definitive recommendations can be given with respect to the age above which sexes should be distinguished and sampled individually. Unless the contrary can be motivated from the specific application the corpus is collected for, each corpus should comprise approximately equal numbers of speakers of both sexes. For some applications, recordings of young children may also be required. Children should be considered as a `third sex', independent from adolescent or adult females and males. Speaker sex is known or suspected to affect at least four aspects of speech behaviour.
Women are reported to tend more towards breathy a voice quality than males. It is not known whether this tendency is related to physiological and anatomical factors or that it is mainly due to culturally determined sex role stereotypes. Overall steeper spectral slope causes problems for some parametric signal processing techniques (e.g. formant extraction ).
Women are reported to adhere more to standard pronunciation than men. It is not known whether this finding, made for (American) English and Dutch, generalises to other languages. It remains to be seen whether sexrelated pronunciation variation is best modeled and described on the level of phonemic representations of words or on the level of the phonetic implementation of what is essentially the same phonemic form. Awaiting results of further research in additional languages/cultures this factor is probably not sufficiently important to attribute great importance to it. Moreover, this aspect is very difficult, if not impossible, to separate from other sex-related factors, and will therefore be duly represented as long as the sexes are adequately represented in the corpus. Variation in pronunciation accuracy may also be caused by factors related to age and social status.
Sex-related differences in vocabulary and syntax are certainly culturally determined. Here, the factor sex interacts with factors like age and social status. Differences on the level of vocabulary and syntax are only relevant when spontaneous speech is being recorded. If all speech material exists of read utterances, vocabulary and syntax are completely determined by the prompting material.
Although the impact of speaker age on speech behaviour has not received much attention in previous research, there are indications that age influences at least two aspects of speech behaviour ( =1 (
; Helfrich 1979) ).
Here the considerations described above in the paragraph on the impact of sex on speech behaviour apply in exactly the same way. There is some literature suggesting that vocabulary and syntax of the older generation are different from the younger speakers, but apart from obvious observations that the subject spontaneously discussed by senior citizens tend to differ there is little hard data to support the claim that age is more important a factor than, for instance, social group and education level.
As with speaker age, most research in the past has concentrated on the question whether people can estimate speaker weight or speaker height from speech recordings alone ( =1 (
; Van Dommelen 1993) ). It appears that people are moderately successful in this task. It will be clear that weight and height of speakers are highly correlated. The exact signal characteristics that enable people to guess the speaker's weight and height are not known. In a sufficiently large sample of speakers, most weight/height groups will probably be represented.
Several investigations have shown that voice quality can change under the influence of smoking or the use of alcohol ( =1 (
; Gilbert Weismer 1974) ). One of the most common consequences of smoking and drinking is premature ageing of the mucous membrane covering the vocalis muscle, resulting in a hoarse voice quality. Excessive drinking may eventually result in brain damage, which may in turn lead to severe speech disorders. The use of drugs can have a similar effect. In those cases it would be more appropriate to speak of pathological speech.
For many purposes it is most appropriate to build speech corpora
with a large variety of speakers. However, the speaker variability should be
kept within reasonable
bounds. Severely
pathological speech will, in general, deviate
substantially from `normal' speech and thus it is usually not desirable to
include this type of speech in a normal speech corpus. On the other hand,
speakers with mild
pathological disorders, such as
hoarseness , can be included in for
instance speech corpora designed for recognition.
Of course, research might focus specifically on pathological
speech, for instance when a recogniser is developed for use as an
environmental control device for handicapped persons.
In that case
pathological speech should of
course be amply represented in the speech
corpus.
Pathological
speech should also be present in a corpus designed to cover as much speaker variation
as possible (a kind of `all-purpose' speech corpus). A more elaborate discussion of
pathological
speech can be found in
=1 (
; Perkins 1977) and =1 (
; Crystal 1980) ).
Professional speakers should be selected when recording very large corpora with very few speakers, for instance to develop text-to-speech systems . The major reason to prefer professional speakers for this purpose is their ability to keep pitch , intensity and speech rate constant, not only during one recording session, but also over several sessions, which may have to be scheduled on different days, perhaps spread over several weeks or even months. One possibly important drawback of using professional speakers must be emphasised: more often than not, professional speakers are not really representative of the `normal' speech behaviour in the community. If the corpus is collected for the development of a text-to-speech system this may not be a problem. However, linguistic and phonetic findings based on a corpus comprising only speech of a small number of highly trained professional speakers should not be generalised without extreme caution.
It is well known that both the regional and the sociolinguistic background of speakers can have a
large effect on their speech. People speak
differently dependent on the specific region(s) in which
they were brought up, and dependent on factors such as the linguistic background of the parents,
social status, and
education level. It is widely assumed that the high-school period
is most
decisive for the regional or dialectal colouring in one's
speech. Therefore it is strongly recommended to obtain
information about the high-school period when collecting data
about the speaker's background.
Dialectal speech or regional/dialectal colouring of the prestige
variant of a
language (like Received Pronunciation (RP) in British English or Hochdeutsch
in Germany) are known to be perhaps the most
important source of speaker-related variation. Not
all languages have a widely accepted and
well
documented pronunciation standard, like RP in English.
Given the enormous amount of literature on Dialectology one would
assume
that the impact of dialects on standard speech is well understood.
Unfortunately, this does not appear to be the case. Linguists and
dialectologists appear to disagree about the number of
major
dialects in a
language area, and about the boundaries between the areas where a specific
dialect is spoken. Moreover, the majority of the dialect studies were
based on
written
questionnaires. Although there are large amounts of recorded dialectal
speech stored in the national Dialectology institutes,
these recordings do not
qualify as corpora, because they exist only on analogue
tapes, with little
or no
detailed annotation .
In collecting new corpora the factor
regional/dialectal colouring should
be properly accounted for. However, since the basic data to determine
number
of dialects and dialect boundaries are difficult to obtain and probably
not
always reliable, it is recommended that dialect is operationalised by
geographic
region. If necessary, processing of the corpus data can yield post hoc data on
dialect differences. However, it has
appeared that post hoc determination of the dialect
background
of a speaker as part of the transliteration /transcription
process poses big
difficulties.
There is one additional factor which complicates the procedures for sampling
dialectal influence, viz. the increasing mobility of the population. It is
acknowledged that the impact of mobility is different between
language
areas
and between countries. However, in sampling for a number of large telephone
speech
corpora in the U.S.A. (POLYPHONE ;
Voice
Across America) a special
`dialect ' called Army Brat was defined, for
those speakers who had lived for short periods of time in many
different
parts
of the country.
It should be noted that the factor dialect not only affects pronunciation.
More often than not, its impact on vocabulary and perhaps also on syntax is
at least as
important. Of course, the impact on vocabulary etc. can only come
to light in corpus collection paradigms which allow the speaker to select
his own words. In corpora comprising only read speech this factor should
have no effect.
Sociolects can be regarded as dialects spoken by a particular
social class.
A clear distinction between different social classes exists, for instance, in India, where each member of
the society belongs to a
specific caste. However, in most cultures it is very difficult to distinguish
between social classes.
The division into three categories lower-class, middle-class, upper-class seems to be most widely
accepted
for Western cultures. Elaborate
schemes have been designed to
determine a
person's social class using factors
such as education, profession, and income. In addition to social class membership a person's
sociolect is, of
course, also influenced by the
linguistic background of the parents and the dialect regions
in which he grew
up. As is the case for dialects ,
sociolects may influence not only pronunciation, but also
syntax and vocabulary.
It is recommended that sociolects should be properly accounted for when
collecting new corpora.
It has been found that the impact of sociolects on speech
behaviour
strongly interacts with speaking style .
Thus, the speech of a pipe-fitter who speaks in a formal way,
may resemble the speech of a salesman who speaks in a casual way (cf.
(
=1 (
;
Labov 1972)
).
This phenomenon
probably also applies to regional dialects .
There is considerable uncertainty on how to treat dialects ,
regiolects and sociolects in corpora collected to develop
speech
technology, e.g. to develop connected speech recognition systems
for use in telephone information systems. There may be large differences between countries
and cultures in what is most appropriate in this respect. Of
course, each operational
recognition system should be able to handle the range of dialectal and
regiolectal influences present in the speech
of
upper and middle class speakers produced in somewhat formal
situations. The extent
to which dialectal influences occurring in
less formal speech, or in formal speech of lower class speakers must also be covered will
depend very much on the application for which the recogniser is being developed. Another
extremely important factor is the social acceptability to use strongly dialectal
speech in a given situation. Acceptance is likely
to
differ strongly between regions in a given country.
If telephone applications are
designed in such a way that all calls originating from a specific
part of the country are handled in a local centre, one may envisage recognition systems which
are
adapted to the local regiolect/dialect , provided that
suitable
training corpora can be collected. When collecting speech corpora over the telephone by
soliciting input from randomly selected subjects one should specify strict guidelines for deciding
whether or not a specific speaker deviates too much from
the `standard' language for him
to be included in the corpus.
The speech of non-native speakers can be regarded as a special `sociolect '.
Some non-native speakers may speak the standard
language of the country they reside in
with only a slight accent, whereas others may speak the
standard language with a very
marked accent and/or a poor control over vocabulary and grammar . There
seems to be no reason to exclude the former
group of non-native
speakers from a common speech corpus,
whereas the latter group of non-native speakers would preferably be excluded, unless the
research is specifically aimed at non-native speech or one wants to build an `all-purpose'
speech corpus.
)