When designing a SPEECH DATABASE a number of decisions have to be made concerning the size and content of utterance lists, size of TAKES, length of SESSIONS, methods of prompting and recording, and dealing with errors. It is assumed in the following that a digital storage method is to be used, although many of the points in the following would apply in a slightly modified form for an analogue recording system.
There follows a classification and discussion of some of the design areas.
The RECORDING MODE defines the exact method of controlling transfer of speech samples to the storage medium. It is defined in terms of the TAKE, SEGMENTS and SPEECH FILES. It is concerned with the stopping and starting of the sampling process and the way the recording system treats any possible speaking errors.
The PROMPTING STYLE primarily defines how a prompting system reacts when speaking errors have been detected during a TAKE and how the subject is instructed to behave after the production of speaking errors. The following classifications assume an automatic prompting system, but non-automatic systems can be classified in the same way. For each style there are three different timing strategies.
A single B&K half inch pressure microphone number 4155 will be used connected to a 2230 level meter, used as a microphone amplifier. If a recording site proposes to use another microphone or preamplifier the alternative choice of equipment should be ratified by other members of the project. IES has suggested a 4165 microphone, 2660 pre-amp and 2636 amplifier. Bochum University recommend the ADG C-414 microphone in omnidirectional mode connected to the John Hardy M-1 mic preamp.
The audio output of the level meter will be connected to both the line input of the OROS AU21 board and the left channel of the digital audio safety backup recorder. The method of splitting the signal will be made in such a way to ensure no degradation of the audio signal particularly in terms of loading, balanced/unbalanced connections and hum loops. Bochum University recommend the Brooke Siren Sytems microphone splitter.
The microphone will be positioned for 90 degree incidence, 50cm from lips 15 degrees off axis. (TNO configuration)
Tests by NPL have shown that this distance is acceptable in terms of speech signal to ambient/monitor noise.
The microphone position relative to the head/lips will be set and maintained by a headrest (or the not so consistent method of requesting the subject to maintain head position).
A Laryngograph will be used for some speaker/material combinations as specified in material choice section. The neck sensors will be positioned in accordance with UCL instructions, namely on either side of the point of the thyroid cartilage. The Lx output of the Laryngograph will be connected to the right-hand channel of the digital audio (safety backup) recorder.
Single channel, digital, direct recording to disc will be used for the microphone signal. Digital audio tape recorder safety backup will be made of microphone channel at all times. Second channel recording direct to disc and/or digital audio tape recording of the Laryngograph signal will be made as required.
The mechanism for direct digital recording to disc will be by use of the OROS AU21 board for single channel or an OROS AU22 for two-channel recordings on the SESAM workstation with EUROPEC software. The sampling rate will be 20KHz. and the standard OROS digital filtering used with an oversampling factor of 4. The nominal resolution of this system is 16 bits. Care should be taken to ensure that OROS board is placed in a slot in SESAM which minimises PC electrical interference.
The `standard' analogue filter on the OROS board is a 20 KHz low-pass filter with 0.3 dB ripple in the passband, 160 dB/octave slope and 80dB rejection. This is followed by a digital filter implemented on the TMS320C25 having the -3dB point at 8KHz. Any DC offset is automatically removed by the OROS/EUROPEC software and the gain of the line input will be set within EUROPEC on a `per-speaker' basis.
The safety back-up (and second sensor recordings if an OROS AU22 is not used) will be made on a digital audio tape system such as the SONY Video PCM or DAT systems, having a specification with regard to bandwidth, signal-to-noise ratio and wow and flutter in excess of that of the OROS AU21 board. The recorder shall be left recording for the WHOLE of a session and not be stopped or rewound at all during the session. This removes the chance of over-recording wanted portions, reduces the operator effort will allow later analysis of a number of features not captured on the direct to disc recordings, such as number of retakes, motivation level of subject, coughs etc.
The recordings will take place in an anechoic room.
The subject will not be provided with an electrical sidetone path for their own voice.
The lighting and temperature levels will be such as to cause no stress to the subject and care must be taken to ensure that there is no distracting light reflection off the prompting screen. Care must also be taken to ensure that the anechoic nature of the recording is not compromised by allowing significant acoustic reflections to be picked up on the microphone. The prompting monitor should be slightly angled and there should be no tables or other sound reflecting surfaces between the subject and the microphone (or tables to be covered with sound absorbing material).
It has been agreed that the IASM-A CORPUS should contain no speaking errors. Further the inter-utterance pauses should be captured in their entirety and there should be no DISCONTINUITIES within a TAKE.
PROMPTING STYLE 1 WILL BE USED TOGETHER WITH RECORDING MODE 1. THERE SHALL BE ONLY ONE TAKE PER FILE.
The recording system will be integrated with the prompting system and this is to be as automatic as possible (see EUROPEC Section III below). It is clearly possible for the prompting system to start the recording system before the first prompt. In the general case of RECORDING MODE 1 as discussed in the earlier part of this document, the end of the final utterance is not defined. However, it has been agreed that a base level annotation will be automatically produced during the TAKE. This will consist of the prompted orthographic text along with the endpoints of each utterance. As each endpoint is known to the prompting/recording system, then this information can be used to stop the recording process at the end of the final utterance.
A TIME ALIGNED ORTHOGRAPHIC ANNOTATION WILL BE PRODUCED AUTOMATICALLY DURING THE TAKE --- THE RECORDING PROCESS WILL BE STOPPED AUTOMATICALLY AFTER THE FINAL UTTERANCE.
However, if in the light of experience, it is found that significant SESSION time is wasted in re-recording TAKES or subjects are being excessively stressed by the `ABORT TAKE on error' style then the next most appropriate technique would be to `BACKUP-ON-FLY' and excise any errors from the TAKE. This technique will require an addition to the previously agreed standard of annotation file, namely a label to indicate the position of the DISCONTINUITY so caused.
PROMPTING STYLE 2 AND RECORDING MODE 2 WILL BE USED ONLY EXCEPTIONALLY AND WITH THE AGREEMENT OF OTHER MEMBERS OF THE CONSORTIUM --- A DISCONTINUITY MARKER TO BE ADDED TO THE ANNOTATION FILE. ONE TAKE PER FILE.
It has been suggested that a regular timing strategy puts too much stress on some speakers and that there is evidence of excessive speaking rate when endpoint timing (i.e. speaker driven prompting rate) is used.
MIXED TIMING STRATEGY WILL BE USED
The exact value for tP (the minimum prompting interval in this strategy) depends on the display time (DLA: [time in seconds] in the prompt file) plus the stop back-off time.
The subject will be prompted by an automatic system --- EUROPEC running on the SESAM workstation positioned outside the anechoic room. The EUROPEC system will be controlled by a recording manager (operator), viewing the standard SESAM monitor. The subject will use a second monitor placed in the anechoic chamber, taking its video input from a `T' connection from the SESAM video output. The size and positioning of the monitor will allow the subject to see the prompted text and level meter without stress. The positioning of the monitor with respect to the microphone will ensure that there is no detectable noise pickup, either acoustically or electrically (particularly at line and frame frequencies.) Bochum University suggest that by using EGA or VGA graphics mode the line frequency is above 20 KHz and therefore unlikely to be a source of interference.
The speaker to microphone distance, azimuth and elevation will be controlled by ensuring minimal head movement commensurate with minimum subject stress. A headrest is highly recommended.
The subject will have, for the material/country combinations specified in the recording material section, an acoustical prompt generated by a second SESAM fed by the RS232 output of EUROPEC.
The operator will continuously listen to the subject at all times via the recording chain. This will allow subject-operator communication and enable the operator to do a 100% check on the speech material content. Bochum University suggest that 2 operators should be employed simultaneously to reduce fatigue. The headphone output of the PCM/DAT digital audio safety backup machine would be a suitable source for audio monitoring.
The operator will speak to the subject in the anechoic room via a one way intercom/loudspeaker system, which will be switched off during takes. The gain of the intercom system shall be set to a level at the ear representing speech 1 metre from the lips of an average talker. The setting of this level is important as it will tend to effect the speaking level of the subject --- and should be consistent across recording sites.
Speaking effort will be controlled for some speaker/material combinations as specified in the reference material section. This will be done by the use of a speech level meter displayed on prompting monitor by EUROPEC. There will be three modes of operation:
Speaking style will be controlled by:
Speaking rate will not be controlled except that the maximum rate of presentation of prompted items will be at a rate set within EUROPEC dependent on the material. This rate will be defined within the choice of material section.
There will be no attempt to produce stress in the subject. The total amount of material recorded in each session will be restricted to 8 minutes representing approximately a 30 minute elapsed time session. This limit will minimise stress on the subject and represents the maximum time that can be recorded on a 20 Megabyte disc.
The recording manager will ensure that the subject is in a relaxed condition.
There will be no attempt to control any speaking effect due to the time of day of the session.
At each session the speaker code of the subject will be confirmed. (Initially this will be done by checking a printed list of speaker details --- an automatic system will subsequently be used.)
Session specific information will be entered via the EUROPEC system --- some manually and some from the conditions file.
At each session the recording manager will provide an outline of the procedure for that session. This briefing should be in written or recorded form so as to achieve consistency between subjects.
At each session in the anechoic room the subject will be asked to practise speaking typical material for that session, with the level meter switched off. A note will be made of the peak level during the practice and the gain of the OROS AU21 will be set such that this peak level is 12--15 dB below maximum possible recording level. A large number of experienced recording sites have rejected the original suggestion of 6 dB head room. They feel that 12 to 15 dBs would be safer and reduce the possibility of takes having to be re-recorded due to overload, this in turn will reduce the stress on inexperienced speakers. The gain setting will be automatically recorded for inclusion in the session dependent recording conditions file.
If the session is to involve the use of control by level meter the subject should practise controlling their level with respect to the meter, for all (3) level meter gain settings.
The recording conditions/material etc. for each take will be entered into EUROPEC --- this will cause the correct file names to be generated.
At each take the subject will, where appropriate, be given the opportunity to preview the prompt material. This is particularly needed for continuous speech material. In some cases, e.g. for particular forms of numbers, orthographically ambiguous nonsense material etc., additional instructions need to be given, and the subject given time for familiarisation with the task before the session.
The recording manager will monitor the speech production with reference to the prompted text. If there are any errors the take will be stopped and the subject asked to start the relevant take again. The disc file will be discarded and the same serial number reused for the re-recording. The digital audio tape recorder safety backup will not be stopped due to re-recording of errors --- only at the end of a session. If after the integrity check a take has to be re-recorded then the digital audio safety backup recorder should be restarted.
At the end of each session the quality of the recordings should be assessed by a 10% review, and the integrity of the item end-point labelling should be 100% checked by using the VERIPEC module within the EUROPEC software.
At the end of each session all files produced should be transferred to Exabyte 8mm tape system or via Ethernet to local mainframe for backup. The available disc space on the SESAM workstation is approximately 20Mbytes, representing 8 minutes of speech --- this would therefore set a convenient and reasonable maximum to the recording session and represents approximately a 30--40 minutes elapsed time per session. All files produced during the session should also be backed up at the same time --- the data is a valuable resource and files are easily lost. The digital audio safety backup tapes produced for each session shall be kept indefinitely as an archive of the session. It is expected that 4 to 6 sessions can be accommodated per tape, and care must be exercised not to over-record a previous session.
If a speech file is lost AND its computer backup is also lost (OR the backup was never made!!!) then the digital tape recording safety backup will have to be used. This raises a number of questions. If the recordings are transferred to computer form in the digital domain, then an OROS AI-PCM type interface will be required --- but this does not allow a resampling rate of 20KHz--19.8KHz is closest possible. This retrieval method is only possible if the recording was made on a PCM or DAT machine where the pre-emphasis was switched off. The alternative is to copy in the analogue domain, the problems here are:
The recording chain will be calibrated in two ways:
A recording of both calibrator signals shall be made every 20 sessions, providing confirmation of gain calibration, and recording path frequency response, phase response and noise figure. The length of recording of each calibrator shall be 10 seconds plus 10 seconds of no signal condition. It would be sensible to check the signal-to-noise ratio using these files at the site to confirm the correct working of the system.
The Laryngograph channel shall also be calibrated, using the UCL 4-7 mark-to-space ratio rectangular wave calibrator.
A pilot set of recordings will be made at each site consisting of 1 speaker speaking 10% of all the sentence and number material and all CVCs. This pilot material from each site will be transferred using floppy discs, one each for calibration, sentences, CVCs and numbers. The material will consist of the speech files together with corpus description files, labelling files, speaker file, text files and conditions file.
The floppy discs will be sent to NPL where the quality and consistency will be assessed by a panel of `experts'. Modifications and clarifications to this set of protocols may then be made.
All the recordings at one site will be collated and checked. Checks will be made for:
The central collation point will configure the material in the correct directory structure for the production of one or more sets of CD-ROM.
)