next up previous contents
Next: Corpus representation Up: Corpus collection Previous: Monitoring

Procedures

The following sections describe procedures which have been used in practical speech recordings.

Equipment

The basic equipment needed for speech recordings consists of

The choice of equipment depends on the choices on the dimensions visibility, environment, communication mode, and on the data to be recorded.

The quality of the recording channel itself (microphone and recording medium) is determined by three characteristics: signal-to-noise ratio, bandwidth, and dynamic range.

For every speech recording a log or journal should be kept. It contains the essential administrative information about recording setup, personnel involved, speaker data, and recording time and date.

It is necessary to record at least the following data for a recording session:

Recording time and date and the recording engineer are independent of the number of speakers or channels recorded. The environment, speakers, and equipment may differ for each channel and thus should be written down separately for each channel. A separation of recording dependent and channel dependent data is thus advisable, and this separation should be made explicit in the layout of a form (or a database structure):

Microphone

Microphones can be classified by their

A well-documented speech corpus should contain data about the microphone, such as make and type (condenser, dynamic, etc.), position of the microphone relative to the speaker or speaker's mouth, possible calibration  procedures, etc.

Unidirectional microphones are sensitive to the direction where the sound comes from. Unidirectional microphones are preferred when a single speaker is recorded in a laboratory environment .

Omnidirectional microphones are not sensitive to the direction where the sound comes from. They are suitable for recordings on location, or when the speaker is moving, e.g. working, walking, or driving a car. Omnidirectional microphones may be used to record several speakers if it is guaranteed that their turns will not overlap; otherwise multiple unidirectional microphones should be used (see ...).

Microphone arrays are a possible means of sound source location. Computer controlled microphone arrays can focus on a single sound source, therewith improving the effective signal-to-noise  ratio dramatically (cf.

=1 (

; Flanagan et al. 1991) ).

The electrical transducer principle is a second dimension along which microphones can be distinguished. For most purposes in speech research the differences between the tranducer principles is not very important, with the exception of carbon button microphones. Carbon button microphones are used in older telephone handsets. They may distort the frequency response of the signal quite considerably. Moreover, their transmission properties may change significantly over time. Electret microphones are more stable than carbon button microphones. They can have almost flat frequency response in the telephone bandwidth , at least in principle. The actual frequency response of a microphone depends primarily on the acoustic properties of the case in which it is encapsulated. Badly designed handsets therefore can have bad frequency response characteristics, regardless of the use of electret transducers. For basic research into the characteristics of the glottal  sound source the phase distortion of the microphone is as important as its amplitude response. For virtually all other research and development purposes in speech phase response is immaterial.

Finally, the microphone position relative to the speaker's mouth can be used to distinguish types of microphones.

Headset microphones usually are attached to headphones via an arm. The position of the microphone relative to the articulatory tract is fixed, and the speaker is free to move the head. However, the microphone has to be positioned very carefully to avoid noise through breathing, and speakers often feel uncomfortable with a headset.

Close-up microphones are attached to the speaker's clothes, usually on the chest. The microphone does not disturb the speaker and it is quite close to the articulatory tract. However, the distance of the microphone varies greatly with body movements, and new noise sources, e.g. frothing of clothes, are introduced.

Table-top microphones usually are uni-directional microphones placed approx. half a meter away from a speaker. The microphone does not disturb the speaker, and the distance of the microphone varies only very little with body movements. However, with more than one speaker in a room there is little channel separation, and new noise sources, e.g. interference from room echo, tapping on table, movements of prompt sheets, are introduced.

Room microphones are omni-directional microphones that are placed in specified positions in a room, independent of the position of a speaker. They are independent of speaker position and can be hidden completely. However, there is little (if any) channel separation, and surrounding noised interfere with the speech signal.

Recommendations

Amplifier/Processor

The signal coming from a microphone must be amplified to be recorded. In many cases, some processing is also needed, e.g. analog to digital conversion, transformations for different encoding schemes, filtering to reduce noise, etc.

Some processing steps have to be performed only once, e.g. analog to digital conversion. Others will be performed repeatedly, e.g. the transformations for different encoding schemes.

Recommendations

Recording Device

Basically, there exist two types of recording devices: tape drives, and computers with hard disks. Recordings to tape are either analog (audio tapes, compact cassette, video tapes) or digital (DAT), whereas recordings to hard disk are always digital.

Ongoing development in the field of audio technology has shifted the emphasis away from analogue recording media to digital recording media. The traditional recording medium has been the reel-to-reel magnetic tape. Apart from a relatively poor signal-to-noise  ratio of typically 60--70 dB, this medium suffers from mechanical problems such as flutter and wow. Moreover, the quality of an analogue speech recording severely degrades after it has been copied repeatedly. Because of these drawbacks, it is strongly recommended to use digital media for the recording of speech. The most widespread digital medium for recording of speech signals is the DAT (Digital Audio Tape). This medium is strongly recommended. Recordings are made on two channels with standard sampling frequency  of 48 kHz, and 16-bit resolution. Another option, that can only be used in a laboratory environment , is to record the speech directly on a high capacity computer disk. Two other digital audio media, the CD-ROM and WORM  (Write Once Read Many), are less suitable for speech recording, because they cannot be erased. That is, data (for instance, speech recordings) can only once be written to a CD-ROM or WORM ; afterwards, the stored data can be read as many times as one likes (compare a gramophone disc). The CD-ROM and WORM  are especially useful for the permanent storage of selected recordings in a database.

The recording devices can be characterized according to the following criteria:

The portability of a recording device is determined primarily by its size and weight, and secondarily by its operating requirements, e.g. power supply, environmental conditions, etc.

Tape drives, analog or digital, come in all sizes, including Walkman-sized DAT recorders. Usually, tape drives are optimized to record or playback signals, i.e. they do not produce very much noise themselves. Portable tape drives usually have only a reduced set of features, they can operate on batteries and are quite immune to adverse environments (some are even water-resistant). Non-portable tape drives offer more features (e.g. remote control, manual setting of recording parameters, computer interfaces), require a permanent power supply and operate in the usual office environments.

Computers too can be divided into portable and desktop computers. In general, they produce significant noise during operation (hard disk spinning, keyboard clicks, system alerts etc.) and must thus be shielded from the signal to be recorded. Furthermore, sound cards in computers are subject to interferences from other devices inside the computer, e.g. noise from the bus, the processor etc. Portable computers are about the size of an A4 book and weigh approx. 2 kg. At present, only high-end portable computers are equipped with signal processing facilities (e.g. signal processor, 16 bit quantization, sample rate > 8 KHz) required for speech recordings.

The capacity of tape drives is almost unlimited because full tapes can be replaced by empty ones quickly and at low cost. Typically, an analog compact cassette holds about 90 minutes of stereo signals, a video cassette up to four hours, a digital DAT tape up to two hours.

The capacity of computers for speech recording is mainly limited by the capacity of the hard disk. A 1 Gigabyte disk can store approx. 8 hours of mono signals (16 bit quantization, 16 KHz sample rate). Such disks are becoming common on many desktop computers and even in portable computers, so that hard disks are suitable recording devices for very many speech recordings already. The major limitation of recording to hard disks is that the hard disk cannot simply be exchanged against another one. This means that the data on a hard disk has to be saved to some backup medium, e.g. magnetic tape or CD-ROM.

Ease of use must be seen under two aspects: first, the ease with which the device can be used to perform the recording; second, the ease with which the recorded data can be accessed for further processing.

Tape drives are easy to set up and speakers are used to them. Howver, especially for analog recordings, it is quite cumbersome to access recordings for further processing. The appropriate tapes have to be located and the tape drive has to be attached to a computer.

Computers as recording devices are still uncommon. They require the expertise of an engineer to be set up correctly, and speakers are easily distracted by the presence of a computer. However, computers offer significant advantages over tapes: recordings can be fully automated, administrative data is collected together with the recordings, and data is available immediately, either for control purposes or further processing.

Recommendations

Speaker Recruitment

Speaker recruitment can be characterized along the following dimensions:

Speakers should be thought of as a valuable resource in speech recordings. It is therefore advisable to build a speaker database which contains for each speaker Preferably, such a database is implemented using a database management system on a computer. This way, data can be entered easily during the preparation of a speech recording. Such a database can also be held on forms in a folder, but then the extraction of speakers according to specific criteria other than the primary ordering criterium is difficult and error-prone.

Recruiting a small (i.e. 1 to 5) or medium (5 to 50) number of speakers is no problem. Depending on the requirements, colleagues, friends, and relatives can be asked to participate. However, one cannot expect any demographic balance in small sets of five or less speakers. The advantage of using friends and relatives is that they may be available for a long period of time, and that they in general can be used for more than one recording.

The recruitment of a large to very large number of speakers is completely different from that of a small number of speakers. Accessing the speakers, scheduling their recordings, evaluating the recordings, and storing the data become so large tasks, that they cannot easily be performed by a single person. Accessing a large number of speakers requires either

Contact addresses, i.e. telephone number or postal address, are expensive: address brokers charge for each address bought, with the risk of the address being useless (the address is wrong, the person is not willing to co-operate, etc.) entirely upon the buyer. Market research institutes have large address databases from which they can select subsets according to specific criteria, but they in general do not give away these addresses. Although addresses allow persons to be contacted directly, e.g. through mail, telephone, or interviewer visits, the rate of return is rather low, typically in the range of less than 5 for mail, 25 % for telephone, and 50 % for interviewer visits.

Public calls for participation, e.g. newspaper advertisement or article, Internet posting, radio or TV announcement, may reach a very large audience. In many cases, a public call can be arranged at little expense --- newspapers, especially the science editors, are willing to co-operate, Internet postings are virtually free, and radio or TV announcements are affordable). The rate of return is usually very low (less than 1 %) but this is compensated for by the sheer size of the audience reached. However, the means to determine the response to a call for participation are limited. Also, the number of callers will not be evenly distributed over time (most people will call immediately after having received the call), which may cause capacity problems. People responding to a public call for participation are highly motivated; however, this does not hold for the population as a whole and thus introduces a bias.

In hierarchical recruitment the task of recruiting m speakers is divided into n tasks of recruiting speakers. Hierarchical recruitment works well if the burden of recruiting speakers can be mapped to some real-world hierarchy, e.g. the employee hierarchiy in a company. The rate of return strongly depends on the success of a person persuading others to participate.

In all three recruitment strategies, incentives may help to increase the motivation to participate and thus the rate of return. Incentives can either be gifts (e.g. telephone cards) or the participation in a lottery with a grand prize. However, such incentives clearly make the recruitment of speakers even more expensive.

Recommendations

Scheduling speakers

Scheduling speakers is important to make optimal use of the recording capacities within a given period of time. Proper scheduling avoids speaker frustration (caused e.g. by having to wait, ever-busy telephone lines, etc.) and allows a maximum number of recordings within the given recording capacity.

If speakers are recorded in a studio, a time slot is reserved for each speaker. This time slot must be sufficient for

In general, five minutes for each of the side-tasks should be sufficient.

If speakers have to travel far then it is almost inevitable that some of them come late or do not appear at all. In such cases it is advisable have some speakers available upon short notice. In any case there must be a person responsible for the scheduling, and this person must be reachable directly by telephone.

For telephone recordings, the number of speakers calling at any one time must be matched to the capacity of the telephone equipment. If potential speakers do not get through because of busy lines, they are likely not to retry. Furthermore, telephone recordings should be possible 24 hours a day, or it must be clear to callers that the service is operational only for a specific period during the day. Note that recording 24 hours a day requires that the recordings be performed automatically because only in rare cases will human operators be available for 24 hours. Again, speakers must be able to reach an operator via telephone, e.g. to report problems or make suggestions.

Recommendations

Speaker Prompting

Speaker prompting is used to elicit directly from a speaker a certain type of speech data, e.g. numbers, dates, times, etc. Such data is much more difficult to obtain in dialogues or role-play. The major problems with speaker prompting are a decrease in the spontaneous quality of the speech, ambiguous prompts that lead to unexpected responses, and the rigid structure of a prompting script, i.e. a sequence of prompts. However, in many applications prompting a user for input is a natural situation, and thus the decrease in spontaneity in the speech is highly welcome.

Four types of prompts can be distinguished:

information or feedback prompts
, e.g.
"You have now completed half the questionnaire."
instruction prompts
, e.g.
"Please read the number under topic 1)"
questions
, e.g.
"what day of week is today?"
repetition prompts

in which a speaker is asked to repeat what has been prompted, e.g. "three four five".

The possible responses to prompts may vary greatly. It is thus advisable to instruct the speaker which responses are expected, e.g. "Please answer the following questions with yes or no". However, restricting the set of allowed responses too strongly will lead to unnatural speech.

In face to face communication situations, there is an influence of the interviewer onto the speaker even when the catalog of prompts is fixed, e.g. in a prompt sheet. Visual communication, deictic references (like pointing with a finger to an item to be read), play a significant role. The interviewer guides the speaker through the script and may immediately correct any errors.

In telephone recordings prompts may be output by a computer or a human interviewer. The advantage of computer prompting is that all speakers hear identical prompts. One disadvantage is that a computer based prompting system strictly follows a predetermined prompting script and may not notice that the speaker is not responding correctly. Computer prompting scripts should thus not take longer than 15 minutes, and the script should be divided into several small units. Between the units, feedback should be given to the user to inform him or her of the status of the recording.

Human interviewers immediately realize whether a response from a speaker is correct, and they are able to correct wrong responses immediately. However, each prompt is an individual utterance so that variations among responses may also result from the prompts.

Each prompting script, should be thoroughly tested before the actual recording of data. The test participants should be candidate speakers, and test conditions must be as similar as possible to those of the actual recording. In the case of computer prompting, it is useful to have a prompting simulator which can be adapted to new prompting scripts easily.

Recommendations

Cost

The cost of a speech recording is determined by the cost for personnel and equipment and by the period of time. The total cost estimate is defined in a budget, and at given times the actual expenditures are compared to the budget plan.

A speech recording project is usually defined by scientifically trained experts, e.g. speech engineers, phoneticians, etc. Once the project has been defined, the data recording can begin.

The minimum personnel requirements for speech recordings (of a large number of speakers) are a project administrator and supervisor, and a system operator; both should be available for the whole recording period. Depending on the speech recording setup and the processing of the signal data, interviewers, scientific personnel, and "Hilfskräfte" are necessary.

The administrator is responsible for the budget and the supervision of the project as a whole, the recruitment of speakers, scheduling of recordings, and the organisation of the data evaluation. The system operator is responsible for the technical and data processing aspects of the speech recording, i.e. the setup of equipment, storage and backup of data, etc. Interviewers are needed for speech recordings in face to face communication situations. A first evaluation of the technical quality of recordings can be performed by "Aushilfskräfte", whereas further processing, e.g. the transliteration or a phonetic segmentation and labelling of utterances require trained experts or scientific personnel.

The cost of personnel is the sum of salary and related infrastructure (room, desk, computer, telephone) and working materials costs. In many cases existing resources can be reused, but it should be clear that they have to be accounted for in the budget.

The cost for equipment consists of the acquisition and maintenance costs. Again, in many cases existing equipment can be reused and it must be accounted for in the budget. Maintenance costs are significant cost factors which often exceed the original acquisition costs. For time-critical projects, maintenance contracts with a guaranteed repair time should be considered.

A speech recording can be divided into the following phases:

All phases are strictly sequential except for recording and evaluation which can be executed in parallel.

Initialization and test, preparation and cleanup take roughly constant time. The initialization and test phase must be considered very important because wrong decisions here will affect the rest of the project. Preparation can be short if the initialization and test phase results in a good procedural setup.

The duration of the recording and evaluation phases depend directly on the number of recordings. As a rough estimate, double the speaking time (prompts and responses) to get an estimate of the time needed to perform an individual recording (speaker instruction, cleanup, etc.). Depending on the quality of the evalaution, the time needed for evaluation may be double (technical evaluation) to ten times (phonetic evaluation and transliteration) the speaking time.

Recommendations

Multi-Channel Recording

For some applications one must record a number of physiological signals besides the acoustic signal, such as a laryngograph  signal, an electromyograph  signal (EMG), air pressure or flow in the vocal tract, articulatory parameters , X-ray data, etc. The major drawback of recording such additional signals is that speakers have to be bothered with measuring equipment, such as a strap with electrodes round the neck in the case of laryngographic  recordings. One should be aware that the measuring equipment may interfere a natural speech production. Therefore, it is recommended to use additional signal recordings only for basic speech research and for specialised purposes, such as examination of voice pathology, and else confine oneself to the basic acoustic signal. For some applications (audio-visual analyses) it may also be useful to make video recordings.

to name a few.

Most of these multi-channel recordings require a high technical effort. The placement of sensors may disturb the speaker (X-rays are even dangerous), and due to the considerable effort, only few corpora of these kinds of measurements exist.

Laryngography

In a laryngograph recording a pair of electrodes is attached to the throat of the speaker on each side of the thyroid cartilage (Adam's apple). This sensor produces a signal proportional to the amount of contact between each vocal fold, e.g during phonation.

Laryngography recordings were taken at the Eurospeech 93 conference in Berlin, where speakers were recorded during their presentations. The data is available in the TED corpus from the BAS (bas@sun1.phonetik.uni-muenchen.de).

Requirements: Laryngography sensors, DAT tape recorder or computer interface; 8 bit quantization approx. 10 kHz sample rate.

Electropalatography

Electropalatography registers the contact of the tongue with the hard palate during articulation. The speaker places a customized thin artificial palate in his or her mouth. This artificial palate contains an array of electrodes which record contact with the tongue.

The data recorded by each electrode is combined to a two-dimensional representation of the palate at any given point in time.

<picture here>

Requirements: Artificial palate individually tailored to a speaker, multi-channel recording device, e.g. computer with a suitable interface; 64 bit quantization (i.e. typically an 8 * 8 array), sample rate 200 Hz.

Electromagnetic Articulography

Electromagnetic articulography (EMA) measures the movement of the tongue and other articulators through tiny induction coils attached to the tongue. The head of the speaker is enclosed by a helmet which holds two (or more) coils that create an electromagnetic field; The signal induced in the coils on the tongue is proportional to the distance from the transmitter coils on the helmet.

<picture>

The EMA provides essentially the same kind of data as the microbeam X-ray (see section ...) but uses a different technology.

Requirements: Articulograph, multi-channel recording device, e.g. a computer with a suitable interface; data rate depends on the quantization, the number of sensors, the number of transmitters (typically 10 sensors and 3 transmitters), and the sample rate (typically 250 Hz).

Cineradiography

X-ray measurements are rarely performed today because of the health hasards they impose on the speaker. However, early recordings are still available on film or, in digital format, on laser disk (Bateson at ATR, Japan).

X-ray measurements show the modification of the articulatory tract during articulation. The movement of the jaw can be seen clearly; tongue and lip movement are often less clear due to the fact that they do not show up very clearly on X-ray. The movement of the vocal folds is too fast to be recorded at the slow frame rate of film recordings.

Requirements: Seldom performed.

Air-flow measurements

In air-flow measurements the speaker wears a mask (usually designed to separate oral and nasal airflow). Flow is usually derived from the pressure drop across a wire-mesh located in a flow head mounted in the mask <picture>

The measurements yield data on the speed, direction, and volume of air flow. Depending on the type of sensor and attachment, the measurement requires that the speaker does not move during articulation.

Requirements: Air flow sensors, data acquisition hardware. The data rate depends on whether phonatory components of airflow need to be captured.

X-ray microbeam

X-ray microbeam provides two-dimensional movement data (usually in the mid-sagittal plane) of selected fleshpoints on the tongue and other articulators. It uses a point-tracking technique to reduce the radiation exposure to the subject to acceptable levels.

Requirements: The equipment is only available at the dedicated microbeam facility in Madison, Wisconsin. Data rate: each fleshpoint is tracked at about 100-200Hz. Typically about 10 fleshpoints are tracked simultaneously.

Nuclear magnetic resonance imaging

Nuclear Magnetic Resonance Imaging is a static (up to now) imaging technique with very good resolution of the soft tissues in the vocal tract. Slices can be freely chosen i.e sagittal, coronal etc.

Requirements: a friendly hospital; sample rate < 1Hz, image resolution 256*256 pixels (typical, with 8 bits pixel depth)

Ultrasound imaging

Ultrasound imaging can be used for obtaining sagittal and coronal images of the tongue(for those locations on the tongue where no air intervenes between transducer and tongue; the transducer is usually held under, and moves with the jaw).

Requirements: Ultrasound machine. The data is usually stored as standard video data. A frame-grabber is needed if data is to be digitized.

Wizard of Oz

section still missing --- can anyone serve with a practical guide on how to set up a WOz experiment?

Legal Aspects

There are differences between countries in legal objections against collecting, storing and disseminating demographic and personal characteristics of speakers in a corpus. In all countries name and address data may freely be stored and published, as long as it is guaranteed that no other data about the person can be linked to name/address. Limitations in this respect must be carefully checked with legal advisers in the country or countries of interest. Another legal issue that must be considered is the consent to record speech and subsequently to make the recordings available to other parties. In legal terms, making available recordings is probably equivalent to publishing them. There are likely to be differences between countries in legislation about recording speech. In some countries recording is legal as long as one of the parties involved in a dialogue gives explicit consent. Probably, the law only states that recording for own use is legal; publication or any other way of dissemination may be illegal, or bound by much stricter regulation. To avoid problems in this area, it is recommended to have speakers sign a statement to the effect that they know that the recordings are made for later dissemination and publication. If that is not possible, e.g. when collecting large corpora of telephone speech, it is recommended that each speaker is explicitly advised to abort the call if (s)he has any objections against recording and subsequent publishing of the speech to be produced. It is very important to consult with legal advisers about the correct formulation of these statements and advises. Special care must be taken when recording and publishing corpora with pathological  speech or with speech of very young children who are not yet able to give conscious consent for publishing. Especially with rare pathologies it may be very easy to trace the speech back to the patient, even if name and address data is not coupled to the data about the pathology. With pathological  speech of mentally healthy adult patients a carefully formulated written and signed consent form may be sufficient. With speech of very young children carefully formulated consent forms signed by the parents are necessary. Legal issues (and ethical ones as well) are especially relevant in the case of surreptitious recording  of speech. In order to circumvent the observer's paradox (see section 3.4.1.7), researchers might want to resort to surreptitious recording  of speech. This could, for instance, be done with a concealed microphone 'in the field' or by tapping telephone conversations. In a strict sense invading the privacy of someone's personal speech might be regarded as illegal by the authorities. And even if it would not be illegal, it might still be regarded as unethical by some people. As far as we know, no linguist has ever been tried for recording speech data surreptitiously . However, to minimize the risk of breaking the law in any way, and to conform to ethical norms as much as possible, the following guidelines can be taken into consideration:

A more elaborate discussion of legal and ethical aspects of surreptitious recording  of speech can be found in Larmouth (1985) and Murray & Murray (1985). )



next up previous contents
Next: Corpus representation Up: Corpus collection Previous: Monitoring



WWW Administrator
Fri May 19 11:53:36 MET DST 1995