Next: General conclusions
Up:
Overview of speech
Previous: Criteria for assessment
- Fundamental research: EEC initiatives produced
basic/advanced
material: EUROM1, EUROM0, SUNSTAR.
- Application-oriented:
National initiative produced recorded/transcribed dialogues in the domain
of flight ticket reservation and information.
- Experimental research: National initiative for advanced material
(dialogue).
- Application-oriented: ONOMASTICA (EEC).
- Telephone: SPEECHDAT (Jydsk Telefon will collect a Polyphone type
Danish
Corpus).
There is a national academic effort. Telecom operator has a history of
being involved in EEC projects and in the domain. Activities are carried
out within National or EEC
funded projects. The basic material exists
(EUROM1) but is still not available. No other known industrial
initiatives.
- Experimental research
--- academics: GRONINGEN Corpus (basic research), BLOEMENDAL corpus, DEMSI
corpus, Collections of speech recorded on analogue tapes. In coordination
with the Reusable resources
task group of Elsnet, the Groningen corpus has
been produced on CDROM for public distribution.
--- CEC: POLYGLOT, EUROM0, EUROM1, NOISE-ROM-0
- Application-oriented
- Telephone: Dutch POLYPHONE
- Lexica: CLEX (Dutch/German/English,
distributed by SPEX and LDC)
- Experimental research: ``Dutch National Corpus'' (similar to BNC, 10
M words spoken material to be collected)
-
Application-oriented: ONOMASTICA (EEC)
- Telephone
--- POLYPHONE: Dutch corpus of similar design, sponsored by SPEX and PTT
Research.
--- SPEECHDAT (EEC)
--- NWO Priority Program: information system about public transportation via
the
telephone.
--- EUROCOCOSDA, COCOSDA
--- Eurescom related Telecom cooperation.
The speech community in Holland is very active. Active participation
in ESCA (president is L.
Pols), Elsnet, Eagles, Eurococosda, Cocosda.
Strong cooperation between academics and telecom. Strong national
effort. Industrial involvement unknown.
SPEX, the Dutch Speech
Processing Expertise Centre located in Leidschendam, is interested to
get
involved, as the Dutch node, in a European infrastructure for
database collection and exploitation. It distributes some corpora.
EUROM1 to be made available on CDROM.
- Experimental research
--- Academics: SCRIBE, HCRC Map Task Corpus: British `Normal Speech Corpus
(advanced), Oxford Acoustic
Phonetic database (8 languages).
--- EEC: EUROM1, EUROM0, POLYGLOT
- Application-oriented
- Telephone: Bramshill corpus of British English Home Office Telephone
recordings: (12 CDs) held by the LDC.
English is represented in the EUROCOCOSDA (TED) , EAGLES, and ELSNET
initiatives.
- Experimental research: National BE-WSJ0CAM, British National Corpus
(BNC)
- Application-oriented: SQALE, ONOMASTICA (EEC)
- Telephone: SPEECHDAT (POLYPHONE corpus to be collected by GEC Marconi)
Very active speech community in a variety of
areas from fundamental
research to technology development and applications. Many
universities, companies, national telecom involved in the domain.
Strong national effort, with active participation in EEC projects.
Basic material seems to be available.
EUROM1 to be made available as
multi-lingual resource (organised by A. Fourcin, UCL). However at
the current time, the only resources whose dissemination are
well-defined are those managed in cooperation with LDC. Other cases
are by local
arrangement.
- Fundamental research
--- National
initiatives produced basic/advanced material (BDSONS, BREF,
BDBRUIT, ICY, SPOT)
--- EEC actions produced basic (EUROM0, EUROM1, POLYGLOT), specific (ACCOR)
material.
- Application-oriented: There are indices of cooperation
between
academics (PSH/DISPE) and industry with the sponsorship of EEC (FREETEL, SPELL).
- telephone / telecom:
Speaker verification corpus (Switzerland)
- Lexicon: BDLEX
French is represented in the main European (and world-wide) current
initiatives such as COCOSDA, EAGLES, EUROCOCOSDA, RELATOR, SQALE,
ONOMASTICA, FREETEL and (through Switzerland) Polyphone,
TED-Martigny. Some of these
initiatives deal with linguistic resource
standardisation, dissemination and production. The very next resources
to be produced in French will be under the sponsorship of
--- the French-speaking network: FRANCIL (Réseau Francophone
de
l'Ingénierie de la Langue): speech synthesis, vocal dictation,
person-machine dialogue
--- EEC: Polyphone type French corpus by Philips (SPEECHDAT), timetable
enquires over the telephone (RAILTEL).
- Situation
The French speech community is quite active. All types of actors are
present with many universities, academic institutions, and both public
and private research
laboratories are involved in national and EEC
projects, in France but also in Switzerland and Belgium. The recent
creation of FRANCIL demonstrates this activity. Major telephone,
telecom companies operate in the field and the many companies
demonstrate the
interest in speech technology.
- Needs
--- telephone databases (SPEECHDAT)
--- application-ready corpora (over-the-phone, spontaneous, different
conditions...)
--- advanced corpora (dialogue, dictation) (Aupelf)
-
Dissemination
There are some attempt at making databases available on the market as
commercial products, by the producer itself or through a licensed
company. Most of the resources produced by academics are available
free to other academics but
distribution to industrials is still a
problem and, when possible, is carried out on a case-by-case basis. As most
academics have no way to handle properly commercial issues, the French
organisation GRD-PRC is very interested in setting up a
national
repository that would take care of the dissemination of both written
and spoken language resources. There is a strong desire for a European center
for resource distribution, that would take care of issues such as IPR
and licensing
agreements.
- Experimental research
--- Academics: basic (PHONDAT1),
German Pronunciation Dictionary.
--- EEC: EUROM1, EUROM0, POLYGLOT
- Application-oriented
--- Academics: PHONDAT2, ERBA (train)
--- Industry: SUNSTAR (EEC)
- Telephone: Siemens Telephone Database
Various national, industrial, and EEC initiatives and cooperations, as:
- Experimental research
--- VERBMOBIL (basic, advanced) by industry and
universities
--- VERBMOBIL-PHONDAT (advanced) by academics
--- TEDspeeches, TEDlaryngo (advanced), TEDphone by Eurococosda
--- Articulation of German Vowels (advanced)
--- Publicly Spoken German, very large database of german
utterances
- Application-oriented
--- ONOMASTICA (EEC)
--- Siemens "1000 read sentences" by company/university
- Telephone
--- TEDphone (Eurococosda)
--- SPEECHDAT (EEC) Polyphone type German Corpus by Siemens
--- Stemmer Telephone Database
- Situation:
German is very active. All actors are present. Many universities, many large
companies, the national telecom, are involved in the
domain.
- Needs:
Basic research material is more or less available, but making EUROMS available
could be important as part of a really multilingual corpus.
Application-oriented corpora are requested by industrials.
- Dissemination:
Spoken language
resources in German are either free for any use,
available for research only, available for project partners only (EEC
projects), or of unknown availability. So it is clear that in Germany,
commercialisation of linguistic resources is an important
issue.
- Experimental research: Basic material
--- EEC: EUROM1,
POLYGLOT
--- Academics: isolated speech 2000 words (alphadigits).
EEC: ONOMASTICA
Considering that Greece is a bit late for what concerns telephone
equipment, speech technology is not the today crucial concern in Greece. So
the investment of telephone companies is low, and there is no promising
national
initiative. The main basic research corpora have been produced due
to EEC contracts, and it is therefore very important to achieve the work so
far done (e.g. Basic material to be made available (EUROM1)).
- Experimental research
--- EEC: EUROM0, EUROM1
--- National: IRST Acoustic Phonetic corpus of 3000 sentences,
AIDA (CVCV words, digits).
- Application-oriented:
ARS1000 database, SUNSTAR database (EEC)
- Telephone:
COLLECT: telephone customers utterances
Italy is member of the COCOSDA, EUROCOCOSDA initiatives.
- Application-oriented:
ONOMASTICA (EEC)
- Telephone
--- POLYPHONE: CSELT is collecting a large telephone speech corpus including
a complete POLYPHONE
data set.
--- SPEECHDAT
- Situation:
All actors present. Italian Speech community is active. Academics, telecom
operator, companies do cooperate.
- Needs: Mainly
telephone (Speechdat) and application-oriented corpora.
Advanced material for basic research is also needed.
- Dissemination: Basic research material(EUROM1) was produced on CDROM
and is being marketed by CSELT;
the availability of other
corpora is not-well defined, but the general
tendancy
is to produce the corpora on CDs with the possibility of being made available
on a case-by-case exchange basis.
- Basic material: has been produced due to national (NTH) initiative and
EEC initiative (EUROMs)
- Telephone corpora: are produced in cooperation by national
telecom
often in
cooperation with Denmark.
- Experimental research
- Application-oriented
- Telephone
--- EEC
---
ONOMASTICA- Norwegian Telecom, SINTEF
--- COST project 232 - NTH multi-accented English speech corpus
collected over dialed-up international telephone lines
--- COST project 249 -NTH, Norwegian Telecom
--- National:
Continuous speech recognition over the telephone line.
Research institutes, National Telecom, telephone operator are present.
EUROM1 to be made available as basic material. General
lack of
basic research corpora.
- Fundamental research:
EUROM1
(CEC)
- Application-oriented:
SUNSTAR (Portuguese accent for English words)
Portugal is represented in some European current initiatives such
as
ONOMASTICA, RELATOR, ELSNET, SPEECHDAT and COST.
- Situation
The Portuguese speech community is relatively small, and its main
actors have been so far Universities and
academic research
institutions. There is very limited funding for basic research in this
area. Telecom companies have only very recently shown some interest in
speech technology. There are extremely few databases for European
Portuguese, although
EUROM1 will soon be made available as basic
material.
- Needs
--- fundamental research (namelly in terms of corpora with phonetic and
prosodic labelling)
--- telephone databases (SPEECHDAT)
--- application-specific corpora
---
advanced corpora (dictation, dialogue)
- Experimental research
--- EEC:
EUROM1
--- National
Corpus oral de referencia del espanol contemporaneo (Advanced)
ALBAYZIN (Exp. and Appl.)
Automatic Speech Recognition PA85/86
Construccion de Sistemas de Reconocimiento de Habla Mediante
Tecnicas De Aprendizaje Automatico TIC-448/89
--- Industry
TANGORA (IBM, also for other European languages (French, Italian??), but
not available)
- Application-oriented
ROARS, POLYGLOT, SUNSTAR (EEC)
- Telephone
- Experimental research
--- National: ALBAYZIN project, child language project (advanced)
- Application-oriented
ONOMASTICA (EEC)
- Telephone
SPEECHDAT (EEC): polyphone corpus to be collected by Vocalis.
All actors present. Many universities, research institutes public or
privates, national telecom operator involved in
the domain. Growing
national effort, accompanied by a survey of textual and spoken corpora in
Spanish. No corpus is currently available on CDROM. Availability and cost
information not well-defined.
- Experimental research
EEC initiatives produced basic material: EUROM0, EUROM1
National initiative produced
basic material:Swedish sentence material.
- Application-oriented
National initiative produced CAR database.
- Experimental
research
WAXHOLM dialogue project (national)
ONOMASTICA (EEC)
A prosodic Swedish database project is under discussion.
Some research institutes
are involved in speech technology. Effort
for basic Swedish resources was supported the Swedish government to
participate in EEC projects and by other national efforts. There is
now national effort towards application-oriented and advanced
material.
But the basic material (EUROM1) is still to be made
available.
Status of telephone-based corpora unknown.
Next: General conclusions
Up: Overview of speech
Previous: Criteria for assessment
WWW Administrator
Fri May 19 11:53:36 MET DST 1995