Next: What is a Up: Spoken Language Lexica Previous: Spoken Language Lexica

Introduction

Lexica for spoken language systems

Spoken language systems are becoming increasingly versatile, and a central problem in developing such a system is the creation of its lexicon. In other areas of language processing, such as computational linguistics, the lexicon is also taking on a more and more central role. The lexicon of a spoken language system may be designed for broad or narrow coverage, for specific applications, with a particular kind of organisation, and optimised for a specific strategy of lexical search. Since the construction of a lexicon is a highly labour-intensive and thus also error-prone job, a prime requirement is for automatising lexicon development as far as possible, and in transferring lexical information from previous applications to new applications.

The main object of this chapter is to provide a framework for relating such concepts to each other and for the formulation of recommendations for development and use of lexica for spoken language systems.

In this introductory section, some basic concepts connected with the use and structure of lexica in spoken language systems are outlined; in the following sections, specific dimensions of spoken language lexica are discussed in more detail. Discussion is restricted to spoken language system lexica; non-electronic lexica for human use (e.g. pronunciation dictionaries in book form) are not considered. Features common to spoken and written language lexica, such as syntactic and semantic information in lexical entries, are only mentioned in passing; see the contribution of the EAGLES Working Group on Computational Lexica on these points. The close relation between spoken language lexica and speech corpora entails a minimal amount of overlap with the contribution of the Spoken Language Corpus chapter of this handbook, which should also be consulted.

In the remainder of the introductory section of this chapter, recent results in the development of spoken language lexica are summarised. The following sections of the chapter are concerned with basic features of spoken language lexica, lexical information, lexicon structure, lexical access, and lexical knowledge acquisition for spoken language lexica.

Lexical information as properties of words

At the present time, information about lexica for spoken language systems is relatively hard to come by. One reason for this is that such information is largely contained in specifications of particular systems, technical reports and books on various aspects of speech processing. Another is that there is not a close relation between concepts and terminology in the speech processing field, and concepts and terminology in traditional lexicography, natural language processing and computational linguistics. Components such as Hidden Markov Models for word recognition, stochastic language models for word sequence patterns, grapheme-phoneme tables and rules, word-oriented knowledge bases for semantic interpretation or text construction are all concerned with the areas of word identity and lexical access, lexical disambiguation , lexicon architecture and lexical representation, but these relations are not immediately obvious within the context of speech technology as a whole, and stochastic word models, for instance, would not generally be regarded as providing lexical information (though in the strict sense of the term, they evidently do provide such information). For the purposes of this handbook, the broader view will be taken.

It is customary to distinguish between system lexica and lexical databases. The distinction between the two may, in specific cases, be blurred, and the unity of the two concepts may also be rather loose, if the system lexicon is highly modular or several lexical databases are used. However, the distinction is a useful one. The distinction between lexica and lexical databases will be discussed below. Since the kinds of information in both these types of lexical object overlap, the term ``spoken language lexicon'' will generally be used in this chapter to cover both types.

Applications of spoken language lexica

With the advent of organisations for coordinating the use of language resources, such as ELRA (The European Language Resources Association) and the word-wide LDC Linguistic Data Consortium, access to information on spoken language lexica will become more widely available. The following overview is temporary, and therefore necessarily partial.

Types of application for spoken language lexica

Lexica for spoken language are used in a variety of systems, including the following:

Automatic spelling correctors (spelling is determined to a large extent by phonological considerations).
Medium and large-vocabulary automatic speech recognition (ASR), as in systems such as SPICOS described in =1 (
; Höge et al. 1985) , =1 (
; Dreckschmidt 1987) , =1 (
; Ney et al. 1988) , =1 (
; Thurmair 1986) , HEARSAY-II described in =1 (
; Lesser et al. 1975) , =1 (
; Erman 1977) , =1 (
; Erman Lesser 1980) , =1 (
; Erman Hayes-Roth 1981) , SPHINX described in =1 (
; Lee Hon Reddy 1990) , ISADORA, (cf. =1 (
; Schukat-Talamazzini 1993) ), or, for example in automatic dictation machines such as IBM's TANGORA (cf. =1 (
; Averbuch et al. 1986) , =1 (
; Averbuch Bahl Bakis 1987) , =1 (
; Jelinek 1985) ) and DragonDictate by Dragon Systems (cf. =1 (
; Baker 1975a) , =1 (
; Baker 1975b) , =1 (
; Baker 1989) , =1 (
; Baker et al. 1992) ).
Speech synthesis in text-to-speech systems , for example in reading machines, speaking clocks. For further speech synthesis applications, various relevant studies such as =1 (
; Allen Hunnicutt Klatt 1987) , =1 (
; Bailly Benoît 1992) , =1 (
; Bailly 1994) , =1 (
; Van Coile 1989) , =1 (
; Klatt 1982) , =1 (
; Klatt 1987) , =1 (
; Hertz Kadin Karplus 1985) , =1 (
; Van Hemert Adriaens-Porzig Adriaens 1987) can be consulted. See also chapter 4 on spoken language output assessment.
Interactive dialogue systems, with speech front ends to databases and enquiry systems and synthesised responses (see for instance =1 (
; Brietzmann et al. 1983) , =1 (
; Niemann et al. 1985) , =1 (
; Niemann et al. 1992) , =1 (
; Bunt al. 1985) ); see also the chapter on interactive dialogue systems.
Speech-to-speech translation systems as developed in the ATR and VERBMOBIL projects, which use various speech recognition techniques, including continuous speech recognition, recognition of new words, word-spotting in continuous speech. For speech translation systems see for instance =1 (
; Rayner et al. 1993) and
=1 (
; Woszczyna et al. 1993) .
Lexica and encyclopaedias on CD-ROM with multi-media (including acoustic) output.
Research and development of spoken language processing systems, in the process of which broader based lexica for written language, coupled with tools such as grapheme-phoneme converters , may be used as sources of information.

Spoken language lexical databases as a general resource

Spoken language lexica may be components of systems such as those listed above, or reusable background resources. System lexica are generally only of local interest within institutes, companies or projects. Lexical databases as reusable background resources which are intended to be more generally available raise the question of standardised storage and dissemination. In general, the same principles apply as for Spoken Language Corpora: they are stored and disseminated using a variety of media. In research and development contexts, magnetic media (disk or tape) were preferred until recently; in recent years, local magnetic storage and wider informal dissemination within projects or other relevant communities via electronic networks such as Internet using standard file transfer protocols and electronic mail has become the norm. Large lexica, and corpora on which large lexica are based, are in general stored and disseminated in the form of ISO standard CD-ROMs.

The following brief overview can do no more than list a number of examples of current work on spoken language lexicography. At this stage, no claim to exhaustiveness is made, and no valuation of cited or uncited work is intended. It is intended to include a more detailed overview in later versions of this chapter.

A number of general lexica with information relevant to spoken language have already been available on CD-ROM for quite some time, including the Hachette and Robert (9 volume) dictionaries for French, the Oxford English Dictionary, the Duden dictionary for German, and the Franklin Computer Corporation Master 4000 dictionary with acoustic output for 83000 words (cf. =1 (
; Goorfin 1989) ).
Several lexica with more restricted circulation have been developed in the context of speech technology research and development. Companies such as IBM, and telecom research and development institutes such as CNET in France have developed large lexica (CNET, for instance, has a 55000 word and 12000 phrase lexicon).
University and other research institutes have also constructed large lexica; in France, for example, such institutes as ENST in Paris, ICP in Grenoble (cf. =1 (
; Tubach Bok 1985) ), Paris (cf. =1 (
; Plenat 1991) , for a pronunciation dictionary of abbreviations) and IRIT in Toulouse (the BDLEX project) have worked on large spoken language lexica. The BDLEX-1 lexicon coordinated by IRIT (cf. =1 (
; Pérennou De Calmès 1987) ) contains 23000 entries, and BDLEX-2 (cf. =1 (
; Pérennou et al. 1991) , =1 (
; Pérennou et al. 1992) , =1 (
; Pérennou Tihoni 1992) ) contains 50000 entries; a set of linguistic software tools permits the construction of a variety of daughter lexica for spelling correction and lemmatisation , and defines a total of 270000 fully inflected forms.
The Belgian BRULEX psycholinguistic lexicon contains information on uniqueness points (the point in a letter tree where a word form is uniquely identified), lexical fields, phonological patterns and mean digram frequencies for 36000 words (cf. =1 (
; Content Mousty Radeau 1990) ).
In the United Kingdom, the Alvey project resulted in tools and lexical materials (cf. =1 (
; Boguraev et al. 1988) ).
In the Netherlands, the Nijmegen lexical database CELEX (cf. =1 (
; Baayen 1991) ), also available on CD-ROM, contains components with 400000 Dutch forms, 15000 English forms and 51000 German forms, together with an access tool FLEX.
For German, lexical databases for spoken language lexica have been constructed by companies such as Siemens, Daimler-Benz, IBM and Philips, as well as in university speech technology departments (e.g. Munich, Erlangen, Karlsruhe, Bielefeld), and in the VERBMOBIL project ( =1 (
; Gibbon 1995) ,
=1 (
; Gibbon Ehrlich 1995) ).
Work in computational lexicology and computational phonology has led to the development of structured lexicon concepts for spoken language such as ILEX (cf. =1 (
; Gibbon 1992) , =1 (
; Bleiching 1992) ) based on the DATR lexical knowledge representation language (cf. =1 (
; Evans Gazdar 1989) , =1 (
; Evans Gazdar 1990) ); the DATR language has been applied to word form lexica in the multilingual SUNDIAL project (cf. =1 (
; Andry et al. 1992) ) by the German partner Daimler-Benz and in the German VERBMOBIL project (cf. =1 (
; Gibbon 1993) ).
The European Commission has funded a number of projects, particularly within the ESPRIT programme, in which questions of multilingual spoken language system lexica have been addressed, albeit relatively indirectly (POLYGLOT, SUNDIAL, SAM, SAM-A), as well as lexicography projects such as MULTILEX in the ESPRIT programme (cf. =1 (
; Heyer Waldhur Khatchadourian 1991) ), GENELEX in the EUREKA programme (cf. =1 (
; Nossin 1991) ) and AQUILEX, which concentrate on multi-functional written language lexica, though extension of the results to spoken language information has been provided for by the adoption of general sign-based lexicon architectures (see the results of the EAGLES Working Group on Computational Lexica).

Lexica in selected spoken language systems

The range of existing spoken language systems is large, so that only a small selection can be outlined, concentrating on older and established systems whose lexicon requirements are fairly well known; the situation is currently in a state of flux and for this reason the most recent developments are not included. Small vocabulary systems are also excluded, as their strong points are evidently not in the area of the lexicon. The concepts referred to in the descriptions are discussed in the relevant sections below.

HARPY

is a large-vocabulary (1011 words) continuous speech recognition system. It was developed at Carnegie Mellon University. HARPY was the best performing speech recognition system developed under the five-year ARPA project launched in 1971. HARPY makes use of various knowledge sources, including a highly constrained grammar (a finite state grammar in BNF [Backus Naur Form] notation) and lexical knowledge in the form of a pronunciation dictionary that contains alternative pronunciations of each word. Initial attempts to derive within-word phonological variations with a set of phonological rules operating on a baseform failed. A set of juncture rules describes inter-word phonological phenomena such as /p/ deletion at /pm/ junctures: /helpmi/ --- /helmi/. The spectral characteristics of allophones of a given phoneme , including their empirically determined durations , are stored in phone templates . The HARPY system compiles all knowledge into a unified directed graph representation, a transition network of 15,000 states (the so-called blackboard model). Each state in the network corresponds to a spectral template . The spectra of the observed segments are compared with the spectral templates in the network. The system determines which sequence of spectra , that is, which path through the network, provides the best match with the acoustic input spectral sequence.

(cf. =1 (

; Klatt 1977) ; see also =1 (

; Lowerre Reddy 1980) )

HEARSAY-II

also uses the blackboard principle (see HARPY), where knowledge sources contribute to the recognition process via a global data base. In the recognition process, an utterance is segmented into categories of manner-of-articulation features, e.g. a stop -vowel-stop pattern. All words with a syllable structure corresponding to that of the input are proposed as hypotheses. However, words can also be hypothesised top-down by the syntactic component. So misses by the lexical hypothesiser, which are very likely, can be made up for by the syntactic predictor. The lexicon for word verification has the same structure as HARPY; It is defined in terms of spectral patterns.

(cf. =1 (

; Klatt 1977) , see also =1 (

; Erman 1977) and =1 (

; Erman Lesser 1980) )

SPHINX

is currently regarded as the state of the art system. It is a large-vocabulary continuous speech recognition system for speaker-independent application. It was evaluated on the DARPA naval resource management task. The baseline SPHINX system works with Hidden Markov Models (HMMs ) where each HMM represents a phone . The total of phones is 45. The phone models are concatenated to create word models, which in turn serve to create sentence models. The phonetic spelling of a word was adopted from the ANGEL System (cf. =1 (

; Rudnicky et al. 1987) ). The SPHINX baseline system has been improved by introducing multiple codebooks and adding information to the lexical-phonological component:

The most likely pronunciation was substituted for the baseform pronunciation of a lexical item in the pronunciation dictionary, retaining the assumption that each lexical item has only one pronunciation.
Different models were created for phones that have typically more than one realisation such as released and unreleased /d/ at the beginning of /dIdmaI/ and before /m/, respectively.
Two subword units were introduced: function-word-dependent phone models and generalised triphone models . Since function words are typically unstressed , phones in function words are very often deleted or reduced, do not serve as proper models for recognition tasks, and account for almost 50% of the errors.

The SPHINX system works with grammars of different perplexity (average branching factor; see the section on word models); the grammars are of a type which can, in principle, be regarded as a specialised tabular, network-like or tree-structured lexicon with probabilistic word-class information:

A null grammar with a perplexity of 997 (i.e. a vocabulary of 997 words was used); in a null grammar any word can succeed a given word.
A word-pair grammar with a perplexity of 60; word-pair grammars are lists of words that can follow a given word.
A bigram grammar with a perplexity of 20; this is a word-pair grammar equipped with word-category transition probabilities.

In word recognition tests, the best results were obtained with the bigram grammar , the most restrictive kind of the grammars mentioned above (96% accuracy compared with 71% for null grammars ).

The SPHINX system makes use of various levels of representation of linguistic units:

phone models (generalised triphones and extra models for function words)
word models (stored in the pronunciation dictionary with one representation for each word).
sentence models (for final confirmation)

(cf. =1 (

; Lee Hon Reddy 1990) ; see also =1 (

; Alleva et al. 1992) )

EVAR

(`Erkennen --- Verstehen --- Antworten --- Rückfragen') is a large-vocabulary continuous speech recognition and dialogue system. It is designed to understand standard German sentences and to react either in form of an answer or a question referring back to what has been said, within the specific discourse domain of enquiries concerning Intercity timetables. The EVAR lexicon has the following properties:

The lexicon includes not only sublanguage -specific words but also many words of the general vocabulary that occur during a dialogue of this kind.
The lexicon contains fully inflected word forms.
The baseforms, so-called Normalformen, e.g. infinitive for verbs, nominative singular for nouns, contain information relevant for all grammatical forms, thus reducing redundancy in the lexicon.
The lexicon contains phonological, syntactic, semantic, and pragmatic information .
Since the system modules need access only to special lexical knowledge (the articulation module makes use of phonological information, while the module in charge of generating the surface structure of an answer also needs syntactic information), access of individual modules to the lexicon is restricted. Preprocessors extract the subset of information relevant for each module.
The lexical unit in the EVAR lexicon is the graphematic word ( graphematisches Wort); so-called phonetic words (standard pronunciation) and so-called grammatical words (syntactic categories plus meanings) are assigned to the graphematic words.
Lexical units are described in attribute-value notation. For example, the attribute WORT takes a graphemic word as its value.
Graphemic words again have the attributes AUSSPRACHE and SYNTAX-TEIL for which values are defined in the form of a Duden standard pronunciation and morpho-syntactic properties such as the attribute-value pair WORTART-Verb. Numbers keep track of the various entries for different meanings or syntactic variants (e.g. reflexive --- non-reflexive), etc. of a lexical item.
In the baseform entries, information on stem , pronunciation of the stem (in ASCII symbols that replace the standard IPA notation), and the inflection pattern is given under SYNTAX-TEIL.
Semantic information includes specifications of semantic features and valence properties as well as selectional restrictions. Fillmore's system of deep structure cases as suggested in =1 (
; Fillmore 1968) has been expanded to 28 cases.

A lexicon administration system is being developed which uses tools for extracting words according to specified criteria, such as ``Look for nouns that express a location'' or ``Look for prepositions that express a direction''.

(cf. =1 (

; Ehrlich 1986) , =1 (

; Brietzmann et al. 1983) , =1 (

; Niemann et al. 1985) , =1 (

; Niemann et al. 1992) )

Recommendations on resources

The following recommendations should be seen in conjunction with recommendations made after the following more specialised sections of this chapter.

For information on resources in specific areas of spoken language systems, consult the other chapters and appendices in this handbook.
For information on resources which relate to written language, consult the other EAGLES working groups.
For general information on resources consult the organisations ELRA ( European Language Resources Association and LDC Linguistic Data Consortium
In preparation for decisions on the use of resources, distinguish between lexical database and the system lexicon
Identify the types of information required for the lexical database and the system lexicon.
Consider the relevant lexical database models and system lexicon architectures.
Develop a systematic concept for the tools required in producing and accessing a lexicon or a lexical database.

Next: What is a Up: Spoken Language Lexica Previous: Spoken Language Lexica

WWW Administrator
Fri May 19 11:53:36 MET DST 1995