Spoken language is central to human communication and has significant links to both national identity and individual existence. With the increase in availability and capabilities of computing resources, there has been and will continue to be a large expansion in computer-based language technologies. These technologies include speech recognition and synthesis, vocal access to information retrieval systems, speech understanding (or spoken language) systems and spoken language translation. Central to progess made in spoken language technologies lie large corpora of speech with associated text, transcriptions, and lexica.
The structure of spoken language is shaped by many factors, including the phonological, syntactic and prosodic structure of the language being spoken, the acoustic environment in which it is produced, and by the communication channel. The speech signal is produced differently by each speaker, each with a unique vocal tract which assigns its own signature to the signal. Speakers have different dialects, accents and speaking rates, and their speech patterns are influenced by their emotional and physical state, and the context in which they are speaking (e.g., reading aloud, in conversation, giving a lecture) and the acoustic environment. Due to the many sources of variability in the speech signal, a great deal of speech data are needed to model different speech characteristics, and in particular, different dialects and accents.
Recent activities, such as the creation of the Linguistic Data
Consortium (LDC) and the Center for Spoken Language
Understanding at
the Oregon Graduate Institute (OGI) in the U.S. and the LRE Relator
project in
Europe, national efforts in Japan, Australia and China, as
well as the international Coordinating Committee for Speech Databases
and Assessment (COCOSDA), point out the growing worldwide awareness of
the need for and importance of large, publicly
available common
corpora for the development and evaluation of language technologies,
particularly speech recognition and spoken language understanding, as
well as for the development and assessment of speech synthesisers.
These corpora allow scientists
to study, understand, and model the
different sources of variability, and to develop, evaluate and compare
speech technologies on a common basis.
Corpora collection in Europe is the result of both National efforts
and efforts sponsored by the
European Community. Several ESPRIT
projects have attempted to create comparable multilingual speech
corpora in some or all of the official European languages. The first
multilingual speech collection action in Europe was in 1989,
consisting of
comparable speech material recorded in five languages:
Danish, Dutch, English, French, Italian. The entire corpus, now known
as EUROM0 includes 8 languages: Danish, Dutch, English, French,
German, Italian, Norwegian, Swedish. Other corpora resulting
from CEC
projects include: SAM/SAM-A EUROM1 (11 languages: Danish, Dutch,
English, French, German, Greek, Itaian, Norwegian, Portuquese,
Spanish, Swedish), ARS (Adverse Recognition system: Italian, English?
**langs** ), POLYGLOT (7 language IWSR
database: Dutch, English,
French, German, Greek, and Spanish, 5 language TTS database: Dutch,
English, French, German, and Greek), ROARS (Robust Analytical
Recognition System: Spanish, ?? **langs**), SPELL (Interactive System
for Spoken European
Language Training --- French, Italian and English),
SUNDIAL (Spoken language queries in the travel domain for English,
French, Italian and German), SUNSTAR (Integration and Design of Speech
Understanding Interfaces; English, German, Danish, Spanish
and
Italian), and ACCOR (cross-language acoustic-articulatory
correlations: Catalan, English, French, German, Irish Gaelic, Italian
and Swedish).
What follows is a brief status of linguistic resources for the
European Countries, as well as a
summary of some of the corpora
resulting from European Community projects. This list results from the
summary of the survey conducted in Europe, which complete version is
presented in annex .