Next: SAM software tools Up: SAM recording protocols Previous: SAM recording protocols

SAM recording protocols

Definition of terms

SAMPLE: A digital representation of the analogue waveform at any one instant in time
SAMPLING RATE: The (normally constant) rate at which SAMPLEs are collected and transferred to a storage medium.
SAMPLING INTERVAL: The time between successive SAMPLES - fixed for any one SAMPLING RATE while the sampling process is not stopped.
DISCONTINUITY: The point in a stream or stored set of SAMPLES where the sampling process has been stopped for longer than one SAMPLING INTERVAL.
(Explanation - it is equivalent to the splice edit point on the tape of an analogue tape recorder.)
SEGMENT: A set of sequential SAMPLES held on the storage medium which does not contain a DISCONTINUITY. A SEGMENT will be bounded by two DISCONTINUITIES
(Explanation - this means that during the period over which a SEGMENT has been collected the process of sampling and storing has not been stopped.)
SPEECH FILE: A set of sequential SAMPLEs held on the storage medium. The SPEECH FILE will contain one or more SEGMENTS.
(Explanation - the sampling process may have been stopped, then started again during the period over which a SPEECH FILE has been collected.)
TAKE: The time over which a subject is invited to complete the production of a specific set of utterances. The TAKE will be recorded in one or more SEGMENTS. A SPEECH FILE will often (but not always) contain one TAKE.
(Explanation - the start of a single TAKE will often be identified by the issuing of a set of instructions to the subject and/or the inputting into a prompting system of the utterance set ID. It will normally be terminated by the prompting system finishing an utterance list and further operator-subject dialogue. There will normally be no operator-subject dialogue within a TAKE.
SESSION: The period of time from when a specific subject arrives at the recording location to when he leaves. During this time one or more TAKEs will be recorded.
SPEECH CORPUS: A set of SPEECH FILES.
SPEECH DATABASE: One or more SPEECH CORPORA complete with the information to allow the CORPORA to be usefully accessed. This information will usually be in the form of time-aligned annotation, speaker descriptions, recording condition files, etc.

Classification of general strategies for recording and prompting

When designing a SPEECH DATABASE a number of decisions have to be made concerning the size and content of utterance lists, size of TAKES, length of SESSIONS, methods of prompting and recording, and dealing with errors. It is assumed in the following that a digital storage method is to be used, although many of the points in the following would apply in a slightly modified form for an analogue recording system.

There follows a classification and discussion of some of the design areas.

Recording mode

The RECORDING MODE defines the exact method of controlling transfer of speech samples to the storage medium. It is defined in terms of the TAKE, SEGMENTS and SPEECH FILES. It is concerned with the stopping and starting of the sampling process and the way the recording system treats any possible speaking errors.

MODE 1: A TAKE is recorded in one complete SEGMENT, errors are not excised.
(Explanation - this means that the sampling and transfer process is started at the beginning of the TAKE, all the acoustic signal, including breath noise, speaking errors are recorded and the sampling and transfer process is only stopped at the end of the TAKE.)
MODE 2: A TAKE is recorded in one complete SEGMENT, except when a speaking error is detected during the TAKE, and then that error is excised from the storage medium. If there have been no speaking errors there will be no DISCONTINUITIES in the SPEECH FILE. For each section of speech removed due to a speaking error there will be one DISCONTINUITY.
MODE 3: A TAKE is recorded in a number of SEGMENTS. There will be one SEGMENT per utterance, and speaking errors will be included in the SPEECH FILE.
(Explanation - in general this means that an endpoint detector is used to start and stop the sampling and transfer process. Apart from the `back-off' time no inter-utterance acoustic signal will be stored in the SPEECH FILE.)
MODE 4: A TAKE is recorded in a number of SEGMENTS. Speaking errors will be excised from the storage medium, and so there will be one SEGMENT per required utterance. This MODE is the same as MODE 3 except that the resulting file will contain no errors.

Prompting style

The PROMPTING STYLE primarily defines how a prompting system reacts when speaking errors have been detected during a TAKE and how the subject is instructed to behave after the production of speaking errors. The following classifications assume an automatic prompting system, but non-automatic systems can be classified in the same way. For each style there are three different timing strategies.

STYLE 1: ABORT TAKE and re-record. The subject is instructed that if he realises that they have made a speaking error then they should indicate the fact to the operator. When the operator is aware of a speaking error the prompting and recording system are stopped by a suitable `escape' mechanism. This situation is indicated to the subject and the prompting system is started to re-record that TAKE.
STYLE 2: BACKUP-ON-FLY n utterances. If and when a speaking error is detected during a TAKE, the prompting system is backed up to a point before the error and continues from there. What happens to the recording will depend on the MODE being employed.
STYLE 3: TACK-ON-END correction. If and when a speaking error is detected during a TAKE the prompting system carries on without stopping until the last required utterance has been produced and then the subject is prompted for one or more utterances to correct for the speaking errors. The recording action depends on the MODE employed.
STYLE 4: NO PROMPTING REACTION. The prompting system carries on producing prompts in the pre-determined order. The subject may or may not be asked to react by correcting the error in some way.

Timing strategy

REGULAR: The timing of the prompt is independent of the rate at which the subject is actually speaking. In its simplest form each prompt is displayed at regular predefined intervals which will depend on the type of utterance that is to be produced. Alternatively the interval may vary from prompt to prompt depending on the expected (not the actual) time to speak the prompted text, i.e. dependent on the number of words/phonemes in a sentence.
ENDPOINT: The timing of the prompt is totally controlled by the production of utterances. The display of each new prompt is controlled by the detection of the endpoint of the last utterance.
MIXED: The timing of the prompt is controlled by a logical combination of a predetermined interval and the endpoint of an utterance. The display of each new prompt is triggered by whichever is later of the predetermined interval or the endpoint. This means that if a subject is slow the prompting system will slow down, but there is a maximum rate of prompting even for the fastest speaker.

Recording protocol

Microphone

A single B&K half inch pressure microphone number 4155 will be used connected to a 2230 level meter, used as a microphone amplifier. If a recording site proposes to use another microphone or preamplifier the alternative choice of equipment should be ratified by other members of the project. IES has suggested a 4165 microphone, 2660 pre-amp and 2636 amplifier. Bochum University recommend the ADG C-414 microphone in omnidirectional mode connected to the John Hardy M-1 mic preamp.

The audio output of the level meter will be connected to both the line input of the OROS AU21 board and the left channel of the digital audio safety backup recorder. The method of splitting the signal will be made in such a way to ensure no degradation of the audio signal particularly in terms of loading, balanced/unbalanced connections and hum loops. Bochum University recommend the Brooke Siren Sytems microphone splitter.

The microphone will be positioned for 90 degree incidence, 50cm from lips 15 degrees off axis. (TNO configuration)

Tests by NPL have shown that this distance is acceptable in terms of speech signal to ambient/monitor noise.

The microphone position relative to the head/lips will be set and maintained by a headrest (or the not so consistent method of requesting the subject to maintain head position).

Other sensors

A Laryngograph will be used for some speaker/material combinations as specified in material choice section. The neck sensors will be positioned in accordance with UCL instructions, namely on either side of the point of the thyroid cartilage. The Lx output of the Laryngograph will be connected to the right-hand channel of the digital audio (safety backup) recorder.

Speech data capture

Single channel, digital, direct recording to disc will be used for the microphone signal. Digital audio tape recorder safety backup will be made of microphone channel at all times. Second channel recording direct to disc and/or digital audio tape recording of the Laryngograph signal will be made as required.

The mechanism for direct digital recording to disc will be by use of the OROS AU21 board for single channel or an OROS AU22 for two-channel recordings on the SESAM workstation with EUROPEC software. The sampling rate will be 20KHz. and the standard OROS digital filtering used with an oversampling factor of 4. The nominal resolution of this system is 16 bits. Care should be taken to ensure that OROS board is placed in a slot in SESAM which minimises PC electrical interference.

The `standard' analogue filter on the OROS board is a 20 KHz low-pass filter with 0.3 dB ripple in the passband, 160 dB/octave slope and 80dB rejection. This is followed by a digital filter implemented on the TMS320C25 having the -3dB point at 8KHz. Any DC offset is automatically removed by the OROS/EUROPEC software and the gain of the line input will be set within EUROPEC on a `per-speaker' basis.

The safety back-up (and second sensor recordings if an OROS AU22 is not used) will be made on a digital audio tape system such as the SONY Video PCM or DAT systems, having a specification with regard to bandwidth, signal-to-noise ratio and wow and flutter in excess of that of the OROS AU21 board. The recorder shall be left recording for the WHOLE of a session and not be stopped or rewound at all during the session. This removes the chance of over-recording wanted portions, reduces the operator effort will allow later analysis of a number of features not captured on the direct to disc recordings, such as number of retakes, motivation level of subject, coughs etc.

Recording environment

The recordings will take place in an anechoic room.

The subject will not be provided with an electrical sidetone path for their own voice.

The lighting and temperature levels will be such as to cause no stress to the subject and care must be taken to ensure that there is no distracting light reflection off the prompting screen. Care must also be taken to ensure that the anechoic nature of the recording is not compromised by allowing significant acoustic reflections to be picked up on the microphone. The prompting monitor should be slightly angled and there should be no tables or other sound reflecting surfaces between the subject and the microphone (or tables to be covered with sound absorbing material).

Recording mode and prompting style

It has been agreed that the IASM-A CORPUS should contain no speaking errors. Further the inter-utterance pauses should be captured in their entirety and there should be no DISCONTINUITIES within a TAKE.

PROMPTING STYLE 1 WILL BE USED TOGETHER WITH RECORDING MODE 1. THERE SHALL BE ONLY ONE TAKE PER FILE.

The recording system will be integrated with the prompting system and this is to be as automatic as possible (see EUROPEC Section III below). It is clearly possible for the prompting system to start the recording system before the first prompt. In the general case of RECORDING MODE 1 as discussed in the earlier part of this document, the end of the final utterance is not defined. However, it has been agreed that a base level annotation will be automatically produced during the TAKE. This will consist of the prompted orthographic text along with the endpoints of each utterance. As each endpoint is known to the prompting/recording system, then this information can be used to stop the recording process at the end of the final utterance.

A TIME ALIGNED ORTHOGRAPHIC ANNOTATION WILL BE PRODUCED AUTOMATICALLY DURING THE TAKE --- THE RECORDING PROCESS WILL BE STOPPED AUTOMATICALLY AFTER THE FINAL UTTERANCE.

However, if in the light of experience, it is found that significant SESSION time is wasted in re-recording TAKES or subjects are being excessively stressed by the `ABORT TAKE on error' style then the next most appropriate technique would be to `BACKUP-ON-FLY' and excise any errors from the TAKE. This technique will require an addition to the previously agreed standard of annotation file, namely a label to indicate the position of the DISCONTINUITY so caused.

PROMPTING STYLE 2 AND RECORDING MODE 2 WILL BE USED ONLY EXCEPTIONALLY AND WITH THE AGREEMENT OF OTHER MEMBERS OF THE CONSORTIUM --- A DISCONTINUITY MARKER TO BE ADDED TO THE ANNOTATION FILE. ONE TAKE PER FILE.

It has been suggested that a regular timing strategy puts too much stress on some speakers and that there is evidence of excessive speaking rate when endpoint timing (i.e. speaker driven prompting rate) is used.

MIXED TIMING STRATEGY WILL BE USED

The exact value for tP (the minimum prompting interval in this strategy) depends on the display time (DLA: [time in seconds] in the prompt file) plus the stop back-off time.

Recording control

The subject will be prompted by an automatic system --- EUROPEC running on the SESAM workstation positioned outside the anechoic room. The EUROPEC system will be controlled by a recording manager (operator), viewing the standard SESAM monitor. The subject will use a second monitor placed in the anechoic chamber, taking its video input from a `T' connection from the SESAM video output. The size and positioning of the monitor will allow the subject to see the prompted text and level meter without stress. The positioning of the monitor with respect to the microphone will ensure that there is no detectable noise pickup, either acoustically or electrically (particularly at line and frame frequencies.) Bochum University suggest that by using EGA or VGA graphics mode the line frequency is above 20 KHz and therefore unlikely to be a source of interference.

The speaker to microphone distance, azimuth and elevation will be controlled by ensuring minimal head movement commensurate with minimum subject stress. A headrest is highly recommended.

The subject will have, for the material/country combinations specified in the recording material section, an acoustical prompt generated by a second SESAM fed by the RS232 output of EUROPEC.

The operator will continuously listen to the subject at all times via the recording chain. This will allow subject-operator communication and enable the operator to do a 100% check on the speech material content. Bochum University suggest that 2 operators should be employed simultaneously to reduce fatigue. The headphone output of the PCM/DAT digital audio safety backup machine would be a suitable source for audio monitoring.

The operator will speak to the subject in the anechoic room via a one way intercom/loudspeaker system, which will be switched off during takes. The gain of the intercom system shall be set to a level at the ear representing speech 1 metre from the lips of an average talker. The setting of this level is important as it will tend to effect the speaking level of the subject --- and should be consistent across recording sites.

Speaking effort will be controlled for some speaker/material combinations as specified in the reference material section. This will be done by the use of a speech level meter displayed on prompting monitor by EUROPEC. There will be three modes of operation:

The level meter gain set to nominal, after the speaker has stabilised in level the recording gain will be adjusted so that the normal peak level of the speech reaches a reference point, 12 dB below peak. The speaker is controlled for consistent speaking level during the subsequent take by being asked to keep to reference point as much as possible.
The level meter gain is set 6dB lower, the speaker keeps to reference point and so will speak 6dB louder.
The level meter gain is set 6dB higher than nominal, the speaker keeps to the reference point and so will speak 6dB more quietly.

Speaking style will be controlled by:

For some specified material the prompting system will show sets of sentences as a single item, and for other material they will be shown as separate items.
For some country/material combinations the speaker will be additionally prompted by an audio stimulus. The audio will be generated as a result of an RS232 serial output from EUROPEC/SESAM.

Speaking rate will not be controlled except that the maximum rate of presentation of prompted items will be at a rate set within EUROPEC dependent on the material. This rate will be defined within the choice of material section.

There will be no attempt to produce stress in the subject. The total amount of material recorded in each session will be restricted to 8 minutes representing approximately a 30 minute elapsed time session. This limit will minimise stress on the subject and represents the maximum time that can be recorded on a 20 Megabyte disc.

The recording manager will ensure that the subject is in a relaxed condition.

There will be no attempt to control any speaking effect due to the time of day of the session.

Recording procedure

At each session the speaker code of the subject will be confirmed. (Initially this will be done by checking a printed list of speaker details --- an automatic system will subsequently be used.)

Session specific information will be entered via the EUROPEC system --- some manually and some from the conditions file.

At each session the recording manager will provide an outline of the procedure for that session. This briefing should be in written or recorded form so as to achieve consistency between subjects.

At each session in the anechoic room the subject will be asked to practise speaking typical material for that session, with the level meter switched off. A note will be made of the peak level during the practice and the gain of the OROS AU21 will be set such that this peak level is 12--15 dB below maximum possible recording level. A large number of experienced recording sites have rejected the original suggestion of 6 dB head room. They feel that 12 to 15 dBs would be safer and reduce the possibility of takes having to be re-recorded due to overload, this in turn will reduce the stress on inexperienced speakers. The gain setting will be automatically recorded for inclusion in the session dependent recording conditions file.

If the session is to involve the use of control by level meter the subject should practise controlling their level with respect to the meter, for all (3) level meter gain settings.

The recording conditions/material etc. for each take will be entered into EUROPEC --- this will cause the correct file names to be generated.

At each take the subject will, where appropriate, be given the opportunity to preview the prompt material. This is particularly needed for continuous speech material. In some cases, e.g. for particular forms of numbers, orthographically ambiguous nonsense material etc., additional instructions need to be given, and the subject given time for familiarisation with the task before the session.

The recording manager will monitor the speech production with reference to the prompted text. If there are any errors the take will be stopped and the subject asked to start the relevant take again. The disc file will be discarded and the same serial number reused for the re-recording. The digital audio tape recorder safety backup will not be stopped due to re-recording of errors --- only at the end of a session. If after the integrity check a take has to be re-recorded then the digital audio safety backup recorder should be restarted.

Integrity checks

At the end of each session the quality of the recordings should be assessed by a 10% review, and the integrity of the item end-point labelling should be 100% checked by using the VERIPEC module within the EUROPEC software.

Backup procedures

At the end of each session all files produced should be transferred to Exabyte 8mm tape system or via Ethernet to local mainframe for backup. The available disc space on the SESAM workstation is approximately 20Mbytes, representing 8 minutes of speech --- this would therefore set a convenient and reasonable maximum to the recording session and represents approximately a 30--40 minutes elapsed time per session. All files produced during the session should also be backed up at the same time --- the data is a valuable resource and files are easily lost. The digital audio safety backup tapes produced for each session shall be kept indefinitely as an archive of the session. It is expected that 4 to 6 sessions can be accommodated per tape, and care must be exercised not to over-record a previous session.

Retrieval procedures

If a speech file is lost AND its computer backup is also lost (OR the backup was never made!!!) then the digital tape recording safety backup will have to be used. This raises a number of questions. If the recordings are transferred to computer form in the digital domain, then an OROS AI-PCM type interface will be required --- but this does not allow a resampling rate of 20KHz--19.8KHz is closest possible. This retrieval method is only possible if the recording was made on a PCM or DAT machine where the pre-emphasis was switched off. The alternative is to copy in the analogue domain, the problems here are:

There is normally a degradation of frequency and phase responses and signal-to-noise ratio is reduced (the latter effect is possibly masked by the reported poor signal to noise of OROS A/D board).
The gain of the original signal is relatively poorly defined because the gain on the PCM unit is infinitely variable and there is no automatic method of noting the gain set for the microphone channel for each speaker. The calibration signal should be used for this purpose.

Calibration

The recording chain will be calibrated in two ways:

A B&K 1KHz 4230 calibrator is placed over the microphone.
The injection of a 4:7 Mark-Space ratio square wave of 20Hz and 100 Millivolts pk-pk. electrical signal into the microphone body. This is achieved by the use of:
1. a signal generator (the circuit of which can be obtained from UCL);
2. a B&K Input Adaptor, JJ 2614, which replaces the microphone capsule;
3. a capacitor of a value equal to that of the capsule element replacing the 1nF capacitor in the B&K adaptor.

A recording of both calibrator signals shall be made every 20 sessions, providing confirmation of gain calibration, and recording path frequency response, phase response and noise figure. The length of recording of each calibrator shall be 10 seconds plus 10 seconds of no signal condition. It would be sensible to check the signal-to-noise ratio using these files at the site to confirm the correct working of the system.

The Laryngograph channel shall also be calibrated, using the UCL 4-7 mark-to-space ratio rectangular wave calibrator.

Inter site consistency and recording procedure verification

A pilot set of recordings will be made at each site consisting of 1 speaker speaking 10% of all the sentence and number material and all CVCs. This pilot material from each site will be transferred using floppy discs, one each for calibration, sentences, CVCs and numbers. The material will consist of the speech files together with corpus description files, labelling files, speaker file, text files and conditions file.

The floppy discs will be sent to NPL where the quality and consistency will be assessed by a panel of `experts'. Modifications and clarifications to this set of protocols may then be made.

Collation of recordings

All the recordings at one site will be collated and checked. Checks will be made for:

Quality of speech material and labelling on a 1% basis.
Consistency of file names.
Number of files of each type.

The central collation point will configure the material in the correct directory structure for the production of one or more sets of CD-ROM.

)

Next: SAM software tools Up: SAM recording protocols Previous: SAM recording protocols

WWW Administrator
Fri May 19 11:53:36 MET DST 1995