next up previous contents
Next: Types of lexical Up: Spoken Language Lexica Previous: Introduction

What is a spoken language lexicon?

Basic features of a spoken language lexicon

A spoken language lexicon may be a component in a system, a system lexicon, or a background resource for wider use, a lexical database, in each case containing information about the pronunciation, the spelling, the syntactic usage, the meaning and specific pragmatic  properties of words; lexica containing subsets of this information may also be referred to as spoken language lexica, though the simpler cases are often simply referred to as wordlists. Where there is little danger of confusion, the term spoken language lexicon will be used to refer indifferently to either a spoken language system lexicon or a spoken language lexical database

A spoken language lexicon is defined as a list of representations of lexical entries consisting of spoken word forms paired with their other lexical properties in such a way as to optimise lookup of these properties. This definition covers a wide range of specific types of spoken language lexicon, from lists in which orthography  provides a more or less indirect representation of a spoken word form pronunciation through tables and dictionaries based on phoneme representations  to declarative  knowledge bases from which details of lexical information are inferred from specific premises (entries) about individual lexical items and general premises (rules) about the structure of lexical items. It also covers application directed special lexicon types based, for instance, on the different requirements for pronunciation tables for recognisers and for synthesisers.

The different kinds of spoken language lexicon are primarily orientated towards the forms of words rather than towards their distribution  in larger units or their meaning. The current state of the art prefers, where possible, to deal with finite sets of fully inflected  words which are actually attested in corpora, rather than with the composition of words on morphological  principles. Questions of word composition are increasing in importance in projects aimed at the recognition of spontaneous continuous speech, in which unknown words from a wider vocabulary, or ad hoc coinages (nonce forms) are encountered, and require intelligent treatment in a robust processor.

Lexical databases and system lexica for spoken language

The distinction between lexical databases and system lexica is a useful one, though in practice more complex distinctions are required. The main characteristics of the two kinds of lexical object are outlined below.

Lexical database:
A spoken language lexical database is often a set of loosely related simpler databases (e.g. pronunciation table, signal annotation file, stochastic word model, main lexical database with syntactic and semantic information).

System lexicon:
Lexical information (i.e. properties of words) referred to during the speech recognition or synthesis process may not be concentrated in one identifiable lexicon in a given system.

Spoken language and written language lexica

Spoken language lexica differ in coverage and content in many respects from lexica for written language, although they also share much information with them. Written language lexica are generally based on a stem , neutral or morphologically  canonical form  (e.g. nominative singular; infinitive), or headword concept, in which generalisations over morphologically  related forms may be included, leading to fairly compact representations. Spoken language lexica for speech recognition are generally based on fully inflected  word forms, as in the dictation system TANGORA with about 20000 entries. Depending on the complexity of inflectional  morphology  in the language concerned, the number of entries is larger than the number of entries in a corresponding conventional dictionary based on stems  or neutral forms by a factor between 2 or 3 and several thousand. Speech synthesis systems for text-to-speech  applications generally do not rely on extensive lexica, but use rule-based techniques for generating pronunciation forms and prosody  (speech melody).

In an orthographically  oriented lexicon, it is generally sufficient to include a canonical  phonemic transcription , based on the citation form  of a word (the pronunciation of a word in isolation) which can be utilised, for example, in sophisticated tools for automatic spelling correction. However, this is inadequate for the requirements of speech recognition systems, in which further details are required.

A spoken language lexicon contains information about pronunciation and pronunciation variants, often including prosodic  information about syllable structure , stress , and (in tone  and pitch  accent languages) about lexical tone  and pitch  accent, and morphological  information about divisions into stems  and affixes . Spoken language lexica are in general much more heavily orientated towards properties of word forms than towards the distributional  and semantic properties of words.

It may happen that a canonical form  or a canonical pronunciation  does not actually occur in a given spoken language corpus; this would be of little consequence for a traditional dictionary, but in a spoken language dictionary it is necessary to adopt one of the following solutions:

  1. Use the canonical form , but mark it as non-occurring.
  2. Adopt an attested form as canonical form  (e.g. nouns occurring only in the plural such as French `ténèbres' darkness, English `trousers', German `Leute', people).

At a more detailed level, orthography  (the division of word forms into standardised units of writing) and phonology (the division of word forms into units of pronunciation) are related in different ways in different languages and also to the morphology  (the division of word forms into units of sense) of the language. The orthographic  notion of `syllable ' serves, in general, in written language lexica for defining hyphenation at line-breaks and certain spelling rules; for this purpose, morphological  information about words is also generally required. In spoken language, however, the phonological notion of `syllable ' is different; it refers to units of speech which are basic to the definition of the well-formed sound sequences of a language and to the rhythmic structure of speech, and forms the basis for the definition of variant pronunciations of speech sounds.

Morphology  is often described at an abstract level which permits generalisations over spoken language and written language forms. However, when complex word forms are put together from combinations of smaller units, different alternations of orthographic  units (letters) and phonological units (phonemes ) occur at the boundaries of the parts of such words. Furthermore, additional kinds of lexical unit are required in the lexicon of a spoken language dialogue system, such as discourse particles, hesitation phenomena, and pragmatic idioms , or so-called functional units (sequences of functional words which behave as a phonological unit) and clitics  (functional words which combine with lexical words to form a sequence which behaves as a phonological unit).

Basic lexicographic coverage criteria

Criteria for the coverage of lexica for spoken language processing systems are heavily corpus determined, and differ considerably from criteria for coverage of lexica for traditional computational linguistics and natural language processing. In theoretical computational linguistics, interests are determined by systematic fragments of natural languages which reveal interesting problems of representation and processing. In natural language processing, broad coverage is often the goal. In spoken language lexica as currently used in speech technology, lexica are always oriented towards a particular well-defined corpus, which has often been specifically constructed for the task in hand. When speech technology and natural language specialists meet, for instance in comprehensive dialogue oriented development projects, these differences of terminology and priorities are a potential source of misunderstanding and disagreement, and joint solutions need to be carefully negotiated.

The main coverage criteria for spoken language lexica may be summarised as follows.

These criteria pertain to words; if other units are used, the criteria apply analogously to these.

The first three criteria are essentials for the current state of speech technology. Conventional expectations in written language processing (in Computational Linguistics and in Natural Language Processing) are widely different, and are expressed in the fourth criterion. Clearly the second and fourth criteria clash; the relation to relevant corpora must therefore be carefully flagged in a spoken language lexicon. The degree of coverage (which for a speech recognition system generally has to be 100%) is expressed for lexica in general in terms of the notions of degree of static coverage (ratio of words in a corpus which are contained in a given dictionary to the number of words in the corpus) and the degree of dynamic coverage (the probability of encountering words which have previously been encountered); the latter value is generally higher than the former (cf. =1 (

; Ferrané et al. 1992) ). On the basis of corpus statistics for typologically different languages such as English (cf. =1 (

; Averbuch Bahl Bakis 1987) ) and French (cf.

=1 (

; Mérialdo 1988) ), two languages which differ widely in their inflectional  structure (English with few verbal inflections , French with a rich verbal inflection  system), interesting quantitative comparisons can be made.

The lexicon in spoken language recognition systems

In a spoken language recognition system, the lexicon is generally divided into two components: the recognition component and the search component. In the recognition component, intervals of the speech signal are mapped by probabilistic systems such as Hidden Markov Models , Neural Networks , Dynamic Programming algorithms, Fuzzy Logic  knowledge bases, to word hypotheses; the resulting mapping is organised as a word lattice or word graph, i.e. a set of word hypotheses, each assigned to a temporal interval in the speech signal. The term word is used here in the sense of `lexical lookup key'. The keys are traditionally represented by orthography , but would be better represented in a spoken language system by phonemic transcriptions . The search component has the task of finding the correct word form from the set of hypotheses delivered by the recognition component, and the correct lexical properties of this word form.

In spoken language recognition system development, a corpus based lexicon of orthographically  transliterated forms is used as the basis for a pronunciation lexicon; the lexicon is often supplemented by rules for generating pronunciation variants. This lexicon is required in order to tune the recognition system to a specific corpus by statistical training: frequencies of distribution  of words in a corpus are interpreted as the a priori probabilities of words in a given context. These a priori probabilities may be based on the absolute frequencies of words, or on their frequencies relative to a given context, e.g. digram ( bigram) frequencies. The functionality of spoken language lexica may be summarised in the following terms.

Recommendations on defining spoken language lexica

  1. Define the anticipated functions of the offline lexical database in spoken language system development and the online system lexicon, bearing in mind the differences between written text and speech, the intended application, and possible economies to be gained in using or creating reusable lexical resources.
  2. Speficy the quantitative coverage (size) of the lexical database and of the system lexicon with reference to the available system components and in terms of .
  3. Specify the qualitative coverage (content) of the lexical database and of the system lexicon with reference to the application domain and in terms of standard types of lexical information.


next up previous contents
Next: Types of lexical Up: Spoken Language Lexica Previous: Introduction



WWW Administrator
Fri May 19 11:53:36 MET DST 1995