Next: Seven main differences Up: Corpus design Previous: Corpus design

Introduction

A spoken language corpus is defined as any collection of speech recordings which is accessible in computer readable form and which comes with annotation and documentation sufficient to allow re-use of the data in-house, or by scientists in other organisations. This tentative definition excludes a large amount of speech recordings on analogue tapes (sometimes even on disks) and recordings without the annotation and documentation which is necessary to use the recordings effectively. For instance, it is well known that virtually all public broadcasting corporations in Europe maintain an archive of recordings of programmes, including newscasts, reports of events ranging from football matches to royal weddings and funerals. However, in most cases these recordings can only be accessed by the date of the original broadcast, and perhaps also by the type of programme. Only in very rare cases transcripts of the speech material in the recordings are available. This makes it extremely difficult and time consuming to use these data for any type of research. Of course, this does not diminish the value of these recordings for cultural and scientific purposes, but due to the inordinate amount of pre-processing necessary for any type of research they do not qualify as spoken language corpus under our definition. In many other respects our definition is very wide and liberal. For instance, a set of computer files containing speech signals, EMG signals and sub- and supraglottal pressure signals measured in two subjects who sustained vowels on different pitch and intensity levels would qualify as a spoken language corpus, provided that the files come with suitable annotation and documentation .
To demarcate the topic of spoken language corpora more precisely, we will first discuss the main differences between written language corpora and spoken language corpora.

About this chapter

In this chapter only the design of spoken language corpora and the use of these corpora are covered. It is expressis verbis not the intention of this chapter to give a comprehensive overview of corpora existing worldwide; we even do not intend to give a comprehensive list of all corpora existing in Europe. For an attempt to survey existing speech corpora the reader is referred to =1 (

; Fourcin et al. 1989) and to Appendix E which contains a list of existing public domain spoken language corpora. The most important reason to refrain from repeating Fourcin's exercise is our conviction that many corpora exist which cannot be identified, and the identification of which makes practically no sense, because they are company proprietary. Corpora, tools, and resources in general are not aims of their own, but means to an independently specified purpose. Thus, the eventual specification of a corpus depends in an essential way on the purpose it is intended to serve. Yet, if that purpose is not too limited and provided the corpus is properly documented and annotated, it is quite likely that it will be useful for other, perhaps unrelated research. For the moment there are few, if any, official standards for corpus development. Given the dependence on the research goal, this is not surprising. For that reason, the recommendations in this chapter concern general aspects and factors that should be considered in designing a corpus, and guidelines for making decisions on these issues.
In the development of a speech corpus, three phases can be distinguished. In the pre-recording phase one has to define the application of the corpus. Furthermore, specifications of experiment design, of linguistic content, of number and type of speakers, and of the physical situation must be established. These topics will be covered in this chapter. In the recording phase speaker instruction and prompting , experiment and recording control, as well as storage are involved. These topics will be covered in chapter 4. In the post-recording phase transcription , corpus lexicon construction , labelling , and database management take place. These topics will be discussed in chapter 5.
In the remainder of this chapter we will focus on the pre-recording phase. More specifically, we want to discuss the following steps in the gathering of a speech corpus:

Defining the application of the corpus.
Specifying the linguistic content of the corpus.
Specifying the number and type of speakers.

Before we embark on these discussions, however, it is necessary to elaborate on the differences between Spoken Language Corpora and Written Language Corpora.

Next: Seven main differences Up: Corpus design Previous: Corpus design

WWW Administrator
Fri May 19 11:53:36 MET DST 1995