Next: Conclusion Up: Tools Previous: Speech research

Computer hardware and software

Current hardware uses a variety of in-house or more widely-spread standards. For example coding is 8 or 16 bits at Apple (Mac) and PCs, U-LAW 8 or 16 bits at Sun (Sparc), Next, VAX, DEC, U-LAW or A-LAW at HP. Available sampling rates are often limited to 8 KHz in the Unix world, but higher rates may be available in the world of PC (DOS/Windows) and Mac depending on the curent or professional I/O boards.

File formats are often indicated with the filename extension they bear. Computer manufacturers as Next and Sun deal with .au (AU) or .snd (SND)files, Apple and Silicon Graphics with .aif (AIF); I/O boards manufacturers may promote their own format (as .voc for SoundBlaster board) and the developper of the Windows operating system Microsoft try of course to impose its .wav (WAVE) format. This situation is complicated by the encoding mean (linear, compressed, data and information intermingled...) and even for the same filename extension, the implementation may vary a bit under different operating systems (WAVE in WINDOWS or UNIX environments, SND in Next or PC/MAC environments). A standardisation initiative comes trough the development of INTERNET to promote an interchange format called MIME.

A major example of the constraints imposed to the speech research community by the market can be demonstrated in looking at the implications of the multimedia standard development in the PC world.

Multimedia standard

The world of PCs has considerably evolved during the past few years along two relevant dimensions:

Operating system: Windows operating system is now world-widely used, and it provides a suitable graphic interface.
I/O boards: the development of Multimedia functionalities implied the availability of low-cost I/O boards to be easily included in a low-level PC configuration (as SoundBlaster, Pro Audio SPectrum...).

The point is now wether these current boards, primarily dedicated to audio output, can satisfy the needs of speech research and applications in term of:

Quality of signal (signal/noise ratio...)
Sampling frequency: the multimedia standard is basically derived from CD Audio standard (44.1KHz) or DAT one (48 KHz). So most of the multimedia compatible I/O boards use sampling rates that are provided through successive entire divisors of this basic frequency (22.5, 11.25 KHz etc...). But sampling rates used in our current speech databases are [actually] at present 16 KHz, 20 KHz... Care should be taken that a continuum of sampling frequencies could be available (lets say from 5 to 50 KHz) on these boards, to satisfy the requirements of the speech research community. It is foreseen that all current cheap boards will not be convenient. Otherwise, on-line re-sampling techniques would be requested (*) to maintain compatibility with existing databases, and for future databases the speech community is to adopt a standard ``audio'' sampling rate.
File format: the multimedia standard apply the same. Most of the boards use ``standard'' (or peculiar) file format definition, for example the main one is WAVE format (.wav). It means that these boards are not able to play the files from our existing databases (SAM or national) which are in a 16 bit linear format, as the WAVE one consists of chunks of data intermingled with chunks of encoding info. The files of these databases would have to be converted (**) on output from one format to the other in order to be played. Future databases should either adopt a new ``market'' standard, or have their files converted on input and output.
Number of channels available (2 or more channels may be requested for various microphones or sensor data recordings).

(*)(**) Using I/O boards without DSP implies that some signal processing will be deported to the PC (speech level detection, min/max measurement, eventual over- or under-sampling). These on-line procedures, augmented with on-line format conversion routines, could increase the CPU load in such a way that low-level SESAM workstation could not be able to support running with a high speech sampling rate for example (or using two channels).

One topic is background compatibility with existing databases, an other one is WHICH format is going to be THE STANDARD, i.e. the worldwide audio/ computer/speech standard. Such a topic is to be considered during the SPEECHDAT project, but it is foreseen that no unique standard will emerge and that conversion routines will remain a big issue. A lot of tools are available but as an example, even for the RIFF WAVE format the conversion between Windows and Unix worlds is all but trivial. At the moment, it is not sure whether the current inter-changeable standard I/O boards in the market will satisfactorily meet the speech research needs or not depending from the target application.

Next: Conclusion Up: Tools Previous: Speech research

WWW Administrator
Fri May 19 11:53:36 MET DST 1995