Next: Computer hardware and Up: Tools Previous: Signal theory

Speech research

Inside the speech research community itself, formats used for speech databases differ widely, depending on their purpose and the applications they are designed for. The major differences apply on the file format itself and on the sampling frequency.

Sampling frequency:

There is a variety of sampling frequencies used in existing speech databases. Here are a few examples:

The higher the sampling rate is, the more space-consuming is the corresponding file, given the same amount of signal. Ten seconds of speech sampled at 10 KHz correspond to a 200 Kb filelength (with a 16 bits quantification), but to a 800 Kb filelength when sampled at 40 KHz. If usual frequencies tend to be higher, because of the high-quality technology being available and of the storage disk space becoming cheaper, many fields in speech research remain attached to middle frequencies for various reasons : because they don't need so high-quality for their purpose (speech synthesis), because they are linked to technology standards (European telephony standard: PCM A-Law (POLYPHONE)), or because their purpose is to deal with low-quality speech (speech recognition) for real applications.

File format:

It is very unlikely that getting a speech file one can guess what is its sampling rate, coding, conditions of recording and even less the age of the speaker. So it is of crucial evidence that information ABOUT the speech signal file must be somehow available in order to use it properly. The minimum requested information concerns of course the way to access the file (byte order, quantification, sampling rate). But information about the recording conditions, the speaker characteristics, the text of the utterances, and various parameters, is more than usefull in real speech studies. There are two main philosophies in force: storing information WITHIN the speech file or OUTSIDE the speech file i.e. in an external file. These two approaches have both pros and cons, and are well-represented respectively by the NIST/SPHERE format and the SAM format.

* NIST/SHERE: This format is provided by the National Institute for Standard and Technology in USA, and claims for a WITHIN approach using a SPHERE header. It consists of an ``object-oriented, 1024-byte blocked, ASCII structure which is prepended to the waveform data. The header is composed of a fixed-format portion followed by an object-oriented variable portion.'' ``The fixed portion is as follows: NIST_1A 1024 The first line specifies the header type and the second line specifies the header length.'' The remaining object-oriented variable portion is composed of object-type-value ``triple'' lines which have the following format: The currently defined objects cover database identification and version, utterance identification, channel count, samples count, sampling rate, min and max level, and A/D settings. ``The list may be expanded for future databases, since the grammar does not impose any limit on the number of objects. The file is simply a repository for ''standard'' object definitions. The single object ``end_head'' marks the end of the active header and the remaining unused header space is undefined'' (but within the 1024 bytes limit)''. (For further information see ftp site in the Appendices).

The NIST/SPHERE format is widely used in US and elsewhere, also for US & Dutch Polyphone and French BREF. It is supported by NIST, maintenance path exists, and it is provided with a set of tools to handle the header (access, update, remove, replace...). The header approach minimises risk of losing track of data identity; header can support both prompt and transliteration texts BUT requires data files to be changed after collection for annotation and also if an upgrade/correction is issued. Header is fixed-length and unpromptable through a text editor.

* SAM:
This format is a European 'standard', defined by SAM consortium (Esprit Project ``SAM'': Speech Assessment & Methodologies). SAM claims for an OUTSIDE approach (headerless) using an associated description file. It consists of a SPEECH FILE + ASSOCIATED DESCRIPTION FILE a speech file contains only speech waveforms an associated description file (ascii) is linked to the speech file.

The files are going within a pair, and their names are identical, the extension last letter excepted, according to SAM terminology. The associated description file is a standard label file with a header an a body. It contains all information which can be usually needed by people working on the files without the database management system. A label file is made of a header and a body (or several). Each line consists of a specific mnemonic followed by the corresponding value:

<MNEMONIC> : <space><value> (optional <coma><space><value>...)

The label header: from LHD: ... to LBD: the label body: from LBD: to ELF: or a new labelbody LBD: ...

In a current annotation file the header contains database identification, file localisation, file production, A/D settings, sampling rate, start and end samples, number of channels, speaker information, and pointers to prompt textfile, recording conditions and protocol. As the format is potentially adapted to store several items in a file, the body contains on-the-field labels for the one or several items recorded in the speech file: sequence beginning (in sample), sequence end, input gain on recording, minimum sample value, maximum sample value, orthographic text promt are present for each item. Discontinuities between the items are indicated if any. Both the content of the header and of the body can be extended, to store new relevant descriptors or labels, provided that adequate mnemonics are created and no contradiction occurs with existing ones.

Further information can be obtained from ??(UCL ?, ELRA ?) The SAM format is widely used in EUROPE for multilingual databases (EUROM1) and for national ones (French BDSONS, English SCRIBE, Italian COLLECT, Spanish ALBAYZIN). The current SpeechDat consortium adopted SAM format for its telephone recordings. (SAM provided a conversion routine from NIST to SAM format on the DARPA/TIMIT CDROMs). The Associated description file implies to have files going together by pair and increases the risk of losing files. But he headerless system keeps data files unchanged after collection during database transcription correction/up- grade. It supports multiple annotation levels. Filelength is not limited, an information is available through a single text editor. The newcoming ELRA (European Language Resources Association) should take care of the maintenance/upgrades of this format.

* Other format The VERBMOBIL project in Germany has developed its own format, handling especially dialog. Examples of databases in Japan:(as JEIDA, ASJ) show no header.

Discussion

So far we have considered that a correct description of a sound data file includes a lot of mandatory fields. The first (and minimal) one contains information to use the file:

file format: Header/no-header (and its length)
number of significant bits, bytes order, sampling rate
type of encoding : Linear (signed/unsigned), logarithmically, compression scheme ...
channel number: mono/stereo, ...

But researchers in their job need a lot more about the speech they study:

information about the corpus: name, prompt text
information about the speaker: sex, age, accent, etc.
information about recording conditions: quiet room, office, car, noise environment...

The development of speech applications in new domains imply many others descriptors being available. Description of new data types (multisensor, multimodal, dialogue) are needed, as well as more complex and complete descriptions of data(dialogue, e.g. in WOZ techniques; multimodal synchronisation; timing notations; additional descriptions as dialogue flow, emotional state, man-machine situation). In an other way the forthcoming development of databases distribution and networking will require information about the sources of the data to be available so as the way to obtain it and the rigth to use it.

The standardisation carried out in previous large collaborative projects must be clearly enhanced; efforts must be devoted to the representation of more complex information on speech data, with associated description files and pointers to various descriptors (including location of the data, source of the data, transformations applied to the sources, country of provenance acknowledgement, restrictions on use, derived information...).

Next: Computer hardware and Up: Tools Previous: Signal theory

WWW Administrator
Fri May 19 11:53:36 MET DST 1995