Next: Summary of the Up: Overview of speech Previous: Overview

Criteria for assessment of the situation of Spoken Language Resources

As our main concern is language, the situation of spoken language resources is assessed on a per language basis. This does not correspond exactly to the situation on a per country basis, as several countries may contribute resources to the development of the same language (for example, Belgium, Switzerland and France all have French linguistic resources), or one country may contribute resources for several languages (Switzerland may contribute to French or German, and the UK has produced the Oxford Acoustic Phonetic database which contains spoken data for several European languages).

While ideally all the European languages should be adequately represented, it is clear that from a commercial standpoint the importance of a language depends on the potential market demand. While in the European market all European languages are of potential interest, large companies tend to prefer English, German, French, Italian and Spanish. This leaves an even stronger need for multilingual EEC initiatives in this domain which can pay attention to the under-represented European languages, as in at least the immediate future such needs cannot not be expected to be filled by industrial demand.

To aid in assessing the current situation in Europe, we provide guidelines on characterizing the spoken language resources, and then summarise the main speech resources in Europe and identify the respective actors in their production. A more complete survey on a per country basis can be found in annex .

Types and specificities of corpora

There are as many types of corpora as relevant factors which can be used to define them: speakers, texts, speech type, recording conditions, tasks and so on. Among this wide range of possible corpora, we may characterise them according to their intended use:

1. Experimental research:

These corpora are widely used for speech technology development and assessment. Much of the basic material was collected several years ago, and more recent technology requires more advanced materials.

1.1 Basic material:
Numbers, Words, Sentences, Logatoms
Number of speakers: medium (100--500)
Several repetitions.

1.2 Advanced material:
Continuous speech, Passages, situated dialogue.
Number of speakers: small to medium (10--200)
Recently the trend has been to increase the number of speakers in such corpora

1.3 Specific databases:
multi-sensors corpora (Lx), articulatory, acoustic, video databases.
Number of speakers: small
These corpora tend to be relatively expensive to collect and may require sophisticated recording facilities and sensors, as well as specialised operators.

2. General-purpose Telephone corpora:

These used for speech recognition and coding over the telephone. These type of corpora are relatively easy to obtain (the speaker only needs to call a specified telephone number) and relatively cost effective. However, with advances in communication technology, some of the problems currently posed by the limited bandwidth and noisy communication channel of today's telephones can be expected to disappear.
Material: word lists, numbers, spelled names
Number of speakers: large (several hundreds to several thousands)

3. Application-oriented corpora:

For specific tasks and/or environments (many of which involve the telephone network). By essence of their application specificity, many of these corpora are not easily reused for other applications.
Domains: Information retrieval --- travel inquiry (train and flight information and reservation), leisure activities, telephone services.
Vocal dictation --- medical, legal and insurance areas. Adverse environments: Car database, Handicap applications.
Number of speakers: variable (small to medium)

The two extremes in corpus type are on one side very specific corpora for fundamental research, which may require complex recording conditions with multi-channel recordings, and a low number of speakers, and at the other application-specific corpora which may be recorded over the telephone with a large number of speakers. In addition to the recorded speech signal, we must highlight the importance and effort required to ensure that the appropriate associated information is provided. This associated information depends heavily on the type of corpus, but at a minimum must include revelant speaker information, transcriptions (at a miminum an orthographic transliteration), prompt material in the case of read-speech corpora, lexica, noise or channel characteristics and details of the recording configuration.

Actors in speech resource production

a. National and/or academic initiatives:: In every country where there is a history of speech research, academics (universities, speech research labs) produce the databases they need. However, these corpora tend to be specific to the needs of the producer and rest the property of the producer.
b. EEC initiatives:: EEC projects are a catalyst of production. Both for basic research and application-oriented databases, they are a way of developing links between academics and industry. Corpora also tend to remain the property of the consortium. As the conortium members are both academic/research and industrial, the needs cover all areas.
c. Telecommunication/telephone sectors:: Major telephone operators historically have been interested in speech technology, and most have their own research centers which collect the necessary corpora for their research activities. Most of these corpora are not publicly available.
d. Industry:: Companies developing or integrating speech products need application-oriented databases. This is true both at national and international level, where foreign languages represent viable market opportunities. The data that can be provided by industry is varied, but for the most part unknown, other than that resulting from EEC initiatives.

The more of these actors present in a given country, the more we tend to find a developed speech community both with existing linguistic resources, but also with a strong demand for additional resources. As the speech community grows and the number of speech-based products extended, the amount of needed resources also grows.

   a) need 1.1, 1.2, 1.3        provide 1.1, 1.2 with a), b), c)
                                provide 1.3 
with a)
 
   b) needs variable            provide 1.1, 1.2 with a), d)
                                provide 2 with c)
                                provide 3 with d), a)

   c) need 1.1, 2               provide 1.1 with a), b)
                     
          provide 2 with b)

   d) need 1.1, 1.2, 2, 3       provide 1), 2) with b)

So far, in reviewing the already existing resources, the presence of traditional actors, and the on-going projects, we assess the current situation as follows. Our starting consideration is that the under-represented European languages will need at minimum the resources that better-represented languages already have (at least the basic resources), and that well-represented languages will need still more resources. These needed resources will come from ongoing projects, and further needs can be foreseen through interviews with relevent actors in the speech research community and industry (ISC).

Next: Summary of the Up: Overview of speech Previous: Overview

WWW Administrator
Fri May 19 11:53:36 MET DST 1995