A spoken language corpus is defined as any collection of speech recordings which is
accessible in computer readable form and which comes
with annotation and documentation
sufficient to allow re-use of the data
in-house, or by scientists in other organisations.
This tentative definition
excludes a large amount of speech recordings on analogue tapes
(sometimes even on disks) and recordings
without the annotation and documentation which is
necessary to use the
recordings effectively. For
instance, it is well known that virtually all public
broadcasting corporations in Europe maintain an archive of recordings of
programmes, including newscasts, reports of events ranging from football
matches to royal weddings and funerals. However, in
most cases these recordings can only be accessed
by the date of the original broadcast, and
perhaps also by the type of programme. Only in very rare cases transcripts of
the speech material in the recordings are available. This makes it
extremely
difficult and time consuming to use these data for any type of research. Of
course, this does not diminish the value of these recordings for cultural
and
scientific purposes, but due to the inordinate amount of pre-processing necessary for
any type of research
they do not qualify as spoken language corpus under our definition.
In many other respects our definition is very wide and liberal. For instance, a
set of computer files containing speech signals, EMG signals and sub- and supraglottal pressure signals measured in two subjects who sustained vowels
on different pitch and intensity levels would qualify as a spoken
language corpus,
provided that the files come with suitable annotation and
documentation .
To demarcate the topic of spoken language corpora more precisely, we will first discuss the main differences between written
language corpora and spoken language corpora.
In this chapter only the design of spoken language corpora and the use of these corpora are covered. It is expressis verbis not the intention of this chapter to give a comprehensive overview of corpora existing worldwide; we even do not intend to give a comprehensive list of all corpora existing in Europe. For an attempt to survey existing speech corpora the reader is referred to =1 (
;
Fourcin et al. 1989)
and to Appendix E which contains a
list of existing public domain spoken language corpora. The most important
reason to refrain from repeating Fourcin's exercise is our
conviction that
many corpora exist which cannot be identified, and the identification of which
makes practically no sense, because they are company proprietary. Corpora,
tools, and resources in general are not aims of their own, but means to
an
independently specified purpose. Thus, the eventual specification of a corpus
depends in an essential way on the purpose it is intended to serve. Yet, if
that purpose is not too limited and provided the corpus is properly documented
and annotated, it
is quite likely that it will be useful for other, perhaps
unrelated research. For the moment there are few, if any, official standards
for corpus development. Given the dependence on the research goal, this is not
surprising. For that reason, the
recommendations in this chapter concern
general aspects and factors that should be considered in designing a corpus,
and guidelines for making decisions on these issues.
In the development of a speech corpus, three phases can be distinguished.
In
the pre-recording phase one has
to define the application of the corpus.
Furthermore, specifications of experiment design, of linguistic content, of
number and type of speakers, and of the physical situation must
be
established. These topics will be covered in this chapter. In the
recording phase speaker instruction and
prompting , experiment and recording control, as well as storage
are involved. These topics
will be covered in chapter 4. In the
post-recording phase
transcription , corpus lexicon construction , labelling , and database
management take place. These topics will be
discussed in chapter 5.
In the remainder of this chapter we will focus on
the pre-recording phase. More specifically, we want to discuss the
following steps in the gathering of a speech
corpus:
Before we embark on these discussions, however, it is necessary to elaborate on the differences between Spoken Language Corpora and Written Language Corpora.