As our main concern is language, the situation of spoken language resources is assessed on a per language basis. This does not correspond exactly to the situation on a per country basis, as several countries may contribute resources to the development of the same language (for example, Belgium, Switzerland and France all have French linguistic resources), or one country may contribute resources for several languages (Switzerland may contribute to French or German, and the UK has produced the Oxford Acoustic Phonetic database which contains spoken data for several European languages).
While ideally all the European languages should be adequately represented, it is clear that from a commercial standpoint the importance of a language depends on the potential market demand. While in the European market all European languages are of potential interest, large companies tend to prefer English, German, French, Italian and Spanish. This leaves an even stronger need for multilingual EEC initiatives in this domain which can pay attention to the under-represented European languages, as in at least the immediate future such needs cannot not be expected to be filled by industrial demand.
To aid in assessing the
current situation in Europe, we provide
guidelines on characterizing the spoken language resources, and then
summarise the main speech resources in Europe and identify the
respective actors in their production. A more complete survey on
a per country
basis can be found in annex .
There are as many types of corpora as relevant factors which can be used to define them: speakers, texts, speech type, recording conditions, tasks and so on. Among this wide range of possible corpora, we may characterise them according to their intended use:
Number of speakers: medium (100--500)
Several repetitions.
Number of speakers: small to medium (10--200)
Recently the trend has been to increase the number of speakers in such corpora
Number of speakers: small
These corpora tend to be relatively expensive to collect and may require sophisticated recording facilities and sensors, as well as specialised operators.
Material: word lists, numbers, spelled names
Number of speakers: large (several hundreds to several thousands)
Domains: Information retrieval --- travel inquiry (train and flight information and reservation), leisure activities, telephone services.
Vocal dictation --- medical, legal and insurance areas. Adverse environments: Car database, Handicap applications.
Number of speakers: variable (small to medium)
The two extremes in corpus type are on one side very specific corpora for fundamental research, which may require complex recording conditions with multi-channel recordings, and a low number of speakers, and at the other application-specific corpora which may be recorded over the telephone with a large number of speakers. In addition to the recorded speech signal, we must highlight the importance and effort required to ensure that the appropriate associated information is provided. This associated information depends heavily on the type of corpus, but at a minimum must include revelant speaker information, transcriptions (at a miminum an orthographic transliteration), prompt material in the case of read-speech corpora, lexica, noise or channel characteristics and details of the recording configuration.
a) need 1.1, 1.2, 1.3 provide 1.1, 1.2 with a), b), c) provide 1.3 with a) b) needs variable provide 1.1, 1.2 with a), d) provide 2 with c) provide 3 with d), a) c) need 1.1, 2 provide 1.1 with a), b) provide 2 with b) d) need 1.1, 1.2, 2, 3 provide 1), 2) with b)
So far, in reviewing the already existing resources, the presence of traditional actors, and the on-going projects, we assess the current situation as follows. Our starting consideration is that the under-represented European languages will need at minimum the resources that better-represented languages already have (at least the basic resources), and that well-represented languages will need still more resources. These needed resources will come from ongoing projects, and further needs can be foreseen through interviews with relevent actors in the speech research community and industry (ISC).