next up previous contents
Next: Monitoring Up: Corpus collection Previous: Introduction

Data collection dimensions

Data collection can be represented in the following dimensions (which are characterized by their extreme positions):

Visibility
: open -- secret
Environment
: studio -- on location
Communication situation
: face to face -- isolated
Communication mode
: dialogue -- prompted
Verification
: online -- offline
Data
: single channel -- multi-channel

These dimensions are not independent. For example, a secret (or candid microphone) recording will not be possible in a studio environment, nor will it be possible to do electro-magnetic articulography measurements outside a specially equipped studio.

Visibility: Open vs. Secret

It has been a concern that people who are aware that their speech is being recorded change their speech behaviour. For that reason some scientists have advocated recording paradigms using `candid  microphones'. Another concern, which is at least as important, is that speech behaviour is strongly influenced by the social situation in which the speaker performs. Speech produced in a court room, for instance, tends to be more formal in terms of vocabulary, syntax  and phonetics  than speech produced in one's living room. Also, read speech, especially when the linguistic material consists of isolated words, invokes a more formal and standard pronunciation than free conversation (cf. =1 (

; Labov 1972) ).

To limit the influence of the fact that the speaker is aware of the fact that his or her speech is being recorded one can hide the microphone and the recorder. Provided that nothing else in the environment  might alert the speaker of something unusual, and provided that nothing else in the situation is unusual, `candid  microphone' speech should be maximally natural. On the other hand, many scientists have had the experience that almost all speakers tend to forget that they are being recorded fairly quickly (say within less than five minutes), if they can be engaged in a task that requires cognitive activity different from reading aloud. Therefore, with the exception of very short recording sessions, using hidden microphones might not yield a big advantage. Thus the question arises whether the possible advantage of slightly better naturalness  outweighs the disadvantages, the most important of which probably is the risk of loss of recording quality because on-line monitoring  of recording level is extremely difficult. Another possible drawback of using hidden microphones is that much effort may be spent in recording speakers who, after debriefing, refuse to give consent for the use of the recordings, even for purely scientific research.

Nevertheless, a substantial part of the British National Corpus has been recorded in situations where at least some persons who participated in a conversation were not aware of the fact that they were being recorded (cf. =1 (

; Crowdy 1993) ). As far as we know, only a very small number of persons demanded that the recordings in which they participated were erased. It is not known whether the material in the BNC has been analysed to investigate possible differences in recording quality and vocabulary between secret and open recordings.

One obvious way to make sure that the speaker cannot know that (s)he is being recorded is to tape telephone speech. But here again, publishing these recordings as part of a corpus requires that the speaker must be debriefed and asked for consent to keep the recordings.

Environment: Studio vs. On Location

Recording in a studio

Most of the older speech corpora have been recorded in sound studios. Studio recordings have the drawback that most subjects do not feel at home in that environment , with all possible impacts on their speech behaviour. However, as long as the speech to be elicited consists of lists of words, words embedded in carrier  phrases, and the like, the `abnormality' due to the unusual reading tasks may outweigh the contribution of the unusual situation.
Studio recordings have the advantage of superior signal-to-noise  ratio, thanks to the excellent acoustic environment  and --- at least as importantly --- the possibility to closely monitor recording levels, distance of the speaker to the microphone, use of superior but volatile condenser microphones , etc.

One must be aware, however, that `studio' is not a well defined concept. Not all rooms called studio have good acoustic properties. It is not at all unusual to find rooms which have indeed relatively low ambient sound levels, but at the cost of very long or extremely short reverberation times. If studios are used to record large corpora, room acoustics calibration  data should be provided with the speech recordings. To that end the procedures proposed in the guidelines for recording the EUROM-1 corpus can be used (see Appendix C): record standard pulsatile noises , e.g. caused by exploding balloons inflated to standard dimensions. From these calibration  recordings the reverberation time and the overall frequency response of the room are easy to determine. It is important that the noise  source is located at approximately the same point in the room where the speaker's head is during the recordings.

Many (small) corpora designed for basic speech research have been recorded in rooms which were not designed as audio studio. This is the case with most simultaneous recordings of speech and EMG signals , which are typically made in research labs of hospitals. More often than not these rooms have had only limited treatment to provide acceptable room acoustics.

For many applications high quality speech recordings from one speaker at the time are required. These recordings should be free of background noise , including noises  made by the speakers themselves. The following guidelines apply specifically to the common situation that speakers are recorded one at the time in a sound studio:

Recording on location

Corpora recorded in the field have the advantage that the speaker is acting in an ecologically realistic environment . In most cases the price to be paid for this advantage is a substantial loss in signal-to-noise  ratio, either because of high ambient noise  levels, or limited possibility to monitor recording levels, distance to microphone, etc. or both. If ecological reality dictates recordings in the field one should nevertheless plan for conditions which allow an audio engineer to monitor the procedure.

Complete ecological validity may not be feasible. For instance, recording a speech corpus in a running car cannot safely be accomplished if the speaker is in the driver's seat.

Two important classes of recordings on location are recordings

Recording speech in actual applications (which are based on speech input) is one obvious way to obtain `realistic' speech data. At least in some countries it is not legally required to advise the user of the service that her or his speech is being recorded, as long as the recordings are only used for research purposes internal to the company which runs the service. This procedure is probably most often used in pilot versions of an application, where the number of users is limited, so that one may realistically hope to be able to process all recorded speech. Such recordings are necessary to systematically evaluate the success or failure of the speech input parts of an application by relating the speech recordings to the log of the use of the application.

Recording speech on the telephone (preferably digital, i.e. ISDN) is suitable for the gathering of limited amounts of speech material from a large number of speakers (POLYPHONE ; =1 (

; Damhuis et al. 1994) ).

A possible drawback of telephone recordings is the limited bandwidth  of the speech signal, typically between 300 Hz and 3000 Hz, which may pose problems for some kinds of basic speech research. For example, the absence of low-frequency components prevents a proper pitch  analysis of the recorded speech using methods which rely on the presence of the fundamental frequency (frequency-domain methods, such as the `harmonic sieve', may yield satisfactory results). The absence of high-frequency components prevents for instance the proper spectral analysis  of consonants, especially fricatives . Apart from the limited bandwidth  of the speech signal, telephone channels can also give a substantial loss in signal-to-noise  ratio, especially in the non-western countries where digital telephone systems are not yet commonplace. Even in modern digital telephone networks signal-to-noise  ratio suffers from the limited dynamic range  that can be accommodated with 8 bit A-law  (or U-law ) coded samples.

If these drawbacks are not important for the research goal, recording of telephone speech appears to be a simple way to collect a large amount of speech data in a very short time. For some applications, such as the training and testing of a telephone speech recogniser, a speech corpus with telephone recordings is of course indispensable. It should be emphasised that telephone speech is suitable for many linguistic research projects, including research into most aspects of dialects  and regional language variants as well as all aspects related to spoken language syntax  and vocabulary.

Communication Situation

In this section I want to compare

Face to face

and

Isolated

communication situations with each other. Is this a sensible thing to do?

Input requested here!

Communication Mode

In this section I want to compare

Dialogue

and

Prompted Interview

with each other. The following text is the one by Els den Os from the chapter previously known as 2.5. I am currently modifying it.

Text den Os

When making studio (or office) recordings, the text to be recorded can in principle be presented both in written and in auditory form. Most researchers seem to have preferred written prompts , mainly because it was thought that spoken prompts  would lead the speakers to adopt the same pronunciation, loudness, speech rate, etc. as the speaker who prepared the prompting  material. There is substantial literature showing that speakers adapt their speech behaviour to the person(s) they are speaking to. However, the impact of this adaptation effect is not well known and only partly understood. It must be expected that adaptation depends on a lot of factors that are difficult to control and that all have to do with the social relation between the partners in the conversation. For this reason the Edinburgh MAP corpus (H. Thompson), collected by the HCRC of Edinburgh University, included dialogues with two familiar and two unfamiliar speakers.

The simplest written prompts  are printed texts containing instructions and/or stimulus material for the speakers. When the recordings are made in a studio, written prompts  can also be controlled by a computer. A program furnishes all the necessary indications to the speaker, such as Pronounce the following sentence... on a terminal screen. The advantage of displaying prompts  on a terminal screen is that the entire recording procedure is organised in a neat way. Some kind of verification procedure can be included to limit the number of errors. Timing of utterance prompting  can also be conducted by the computer. This may to some extent enhance the uniformity in speech rate between different speakers. On the other hand, more natural speech may be obtained, if speakers are allowed to establish their own pace.

When recording telephone speech written prompts  cannot always be used because the address of speakers (to send them a prompt sheet) is unknown.

We know of no reports showing that isolated word speech recognition systems trained with speech elicited by asking callers to repeat the prompts  perform worse than devices trained with words prompted  in other ways.

It has been the experience of many corpus projects that it is very difficult to design unobtrusive questions which have a very high likelihood of prompting  specific answers; in a way, the two aspects are contradictory: the higher the likelihood that there is no other answer than the intended one, the more `stupid' the question seems to be. On the other hand, asking questions --- even stupid ones --- comes much closer to the situation a caller will find himself in when using a Voice Response System  than asking a caller to repeat words or phrases verbatim.

Verification

On-line verification

The following text is that of Els den Os. I am currently rewriting it, but I will need further input. Input requested here!

Text den Os

Many of the older speech corpora recorded in studio environment  and consisting of read speech only have employed a recording paradigm where a trained experimenter monitored all productions for exact conformance with the prompting  text. If a deviation of that text was noted, the speaker was requested to repeat the item until it was correctly produced. Within this on-line monitoring  paradigm two sub-paradigms can be distinguished: one in which each error is signalled to the computer by the experimenter without interrupting the speaker (so that the erroneous items could be presented a second time after the completion of the item list); and one in which the speaker is immediately alerted that an error has occurred so that the last item can be repeated. With direct recording to disk file (especially when individual files are planned for each item) both techniques can save an enormous amount of post-processing time, compared with a protocol in which the speech is recorded on a sequential medium (irrespective of the analogue or digital recording technique).

Therefore, if on-line verification  is planned, use of a computer-controlled workstation that records directly to computer files is strongly recommended. One such workstation has been built in the ESPRIT SAM Project.

On-line verification  is only possible with read speech. If the text to be recorded includes contemporaneous, free speech it cannot be predicted beforehand what the speaker will say. In these cases there is no way around the time consuming post hoc transliteration .

Off-line (or Post-hoc) verification

Post hoc transliteration  was employed in collecting some (very) large corpora like Voice Across America and POLYPHONE . Characteristic for the recording protocol used in these corpus collection projects was that the recording workstation performed completely automatically and unattended. It has been suggested that the read items in these corpora might have been verified on-line  by using an automatic speech recogniser running in verification mode, but except perhaps for an Italian corpus consisting of isolated utterances of a small vocabulary we know of no corpus collection projects that have indeed employed automatic verification and re-prompting  in a fully automatic recording workstation.

On-line verification  is the only practical paradigm that guarantees that the corpus will indeed contain exactly the items and the number of repetitions planned for during corpus design. The procedure has one very important limitation, that should not be underestimated: it will yield a corpus which is (virtually) completely devoid of disfluencies, out-of-vocabulary words, coughs, sneezes, etc. Cleaned-up corpora of the type implied here have misled engineers to think that speech recognisers had reached performance  levels sufficient for actual applications. What they failed to realise --- due to the absence of these phenomena from the training materials --- was that in real life disfluencies etc. abound, and that these phenomena may be more important in determining the real life performance  of speech recognisers than recognition error rate on a clean corpus. For this reason it is strongly recommended to use post hoc transliteration  whenever that is possible. In making this recommendation it is acknowledged that recording disfluencies etc. makes no sense in recording speech material for carefully designed perception experiments.



next up previous contents
Next: Monitoring Up: Corpus collection Previous: Introduction



WWW Administrator
Fri May 19 11:53:36 MET DST 1995