Next: Types of lexical Up: Spoken Language Lexica Previous: Introduction

What is a spoken language lexicon?

Basic features of a spoken language lexicon

A spoken language lexicon may be a component in a system, a system lexicon, or a background resource for wider use, a lexical database, in each case containing information about the pronunciation, the spelling, the syntactic usage, the meaning and specific pragmatic properties of words; lexica containing subsets of this information may also be referred to as spoken language lexica, though the simpler cases are often simply referred to as wordlists. Where there is little danger of confusion, the term spoken language lexicon will be used to refer indifferently to either a spoken language system lexicon or a spoken language lexical database

A spoken language lexicon is defined as a list of representations of lexical entries consisting of spoken word forms paired with their other lexical properties in such a way as to optimise lookup of these properties. This definition covers a wide range of specific types of spoken language lexicon, from lists in which orthography provides a more or less indirect representation of a spoken word form pronunciation through tables and dictionaries based on phoneme representations to declarative knowledge bases from which details of lexical information are inferred from specific premises (entries) about individual lexical items and general premises (rules) about the structure of lexical items. It also covers application directed special lexicon types based, for instance, on the different requirements for pronunciation tables for recognisers and for synthesisers.

The different kinds of spoken language lexicon are primarily orientated towards the forms of words rather than towards their distribution in larger units or their meaning. The current state of the art prefers, where possible, to deal with finite sets of fully inflected words which are actually attested in corpora, rather than with the composition of words on morphological principles. Questions of word composition are increasing in importance in projects aimed at the recognition of spontaneous continuous speech, in which unknown words from a wider vocabulary, or ad hoc coinages (nonce forms) are encountered, and require intelligent treatment in a robust processor.

Lexical databases and system lexica for spoken language

The distinction between lexical databases and system lexica is a useful one, though in practice more complex distinctions are required. The main characteristics of the two kinds of lexical object are outlined below.

Lexical database:

A spoken language lexical database is often a set of loosely related simpler databases (e.g. pronunciation table, signal annotation file, stochastic word model, main lexical database with syntactic and semantic information).

Purpose:
- Resource for system development (training, evaluation; construction of stochastic language models)
- Definition of vocabulary coverage
- Basis for vocabulary consistency maintenance
- Reference point for integrating different kinds of lexical information
- Source of information for investigation of vocabulary structure
Structure:
- Generally fixed record structures, with fields for different types of lexical information, and strings as values in fields
- Often identification of lexical key (lexical identifier) with orthographic word form (Disadvantage: problem with orthography, particularly with large vocabularies: homographs are different lexical items, with different pronunciations and meanings, and are thus an artefactual source of ``lexical noise'')
- Alternative for larger databases with more complex linguistic information: Unique identification of word as a more abstract unit with an abstract formal identifier, with specific properties including orthography, pronunciation, etc., on an equal footing.
- Implementation generally conforming to local laboratory standards as a database of ASCII strings, created and accessed by means of standard UNIX tools and UNIX shell scripts, or C programmes; in more complex environments perhaps with a commercial database such as Oracle
Content:
- Main lookup key attribute (in general orthographic representation, perhaps supplemented by numbering of homographs)
- Database entries may be fully inflected forms, uninflected stems, or morphemes (generally morphs, i.e. the phonemic forms of morphemes), or all of these; other inventories containing units such as phonemes, diphones or syllables, may be included
- Pronunciation (in canonical phonemic representation, perhaps including pronunciation variants
- Sub-word boundaries between units such as syllables, morphs (phonemic forms of affixes, lexical roots), derived stems and constituents of compound words
- Syntactic category (e.g. Noun, Adjective, Article, Pronoun, Verb, Adverb, Preposition, Conjunction, Interjection) or subcategory (e.g. Proper vs. Common Noun, Intransitive vs. Transitive vs. Ditransitive vs. Prepositional, etc., Verb)
- Semantic categories (domain or application specific)
- Corpus information: frequency statistics; concordance information (i.e. list of contexts of occurrence for each word); signal annotations
- Further information: Translation information (in

System lexicon:

Lexical information (i.e. properties of words) referred to during the speech recognition or synthesis process may not be concentrated in one identifiable lexicon in a given system.

Purpose: Definition of those properties of words required for recognition, parsing and understanding, or for planning, formulation and synthesis.
Structure: In general separate modules for different properties of words with different functions within the system (which are often not regarded as having anything at all to do with a lexicon)
- In speech recognition: Modules such as the word recogniser (typically based on Hidden Markov Model technology), which identifies word forms, i.e. recognition oriented lexical access keys, often in orthographic representation), the stochastic language model (which defines statistical properties of words in their immediate contexts as bigrams, trigrams, etc.), and the main lexicon with syntactic and semantic information, perhaps linked to an application-specific database or knowledge-base.
- In speech synthesis: Modules defining word structure in terms of morpheme sequences, word-prosody (e.g. accentuation), and pronunciation (in terms of phonemes), supplemented by detailed rules for phoneme variants in different contexts and for timing information.
Content: Application specific; subsets of information defined in the lexical database resource, as outlined under ``Structure''.

Spoken language and written language lexica

Spoken language lexica differ in coverage and content in many respects from lexica for written language, although they also share much information with them. Written language lexica are generally based on a stem , neutral or morphologically canonical form (e.g. nominative singular; infinitive), or headword concept, in which generalisations over morphologically related forms may be included, leading to fairly compact representations. Spoken language lexica for speech recognition are generally based on fully inflected word forms, as in the dictation system TANGORA with about 20000 entries. Depending on the complexity of inflectional morphology in the language concerned, the number of entries is larger than the number of entries in a corresponding conventional dictionary based on stems or neutral forms by a factor between 2 or 3 and several thousand. Speech synthesis systems for text-to-speech applications generally do not rely on extensive lexica, but use rule-based techniques for generating pronunciation forms and prosody (speech melody).

In an orthographically oriented lexicon, it is generally sufficient to include a canonical phonemic transcription , based on the citation form of a word (the pronunciation of a word in isolation) which can be utilised, for example, in sophisticated tools for automatic spelling correction. However, this is inadequate for the requirements of speech recognition systems, in which further details are required.

A spoken language lexicon contains information about pronunciation and pronunciation variants, often including prosodic information about syllable structure , stress , and (in tone and pitch accent languages) about lexical tone and pitch accent, and morphological information about divisions into stems and affixes . Spoken language lexica are in general much more heavily orientated towards properties of word forms than towards the distributional and semantic properties of words.

It may happen that a canonical form or a canonical pronunciation does not actually occur in a given spoken language corpus; this would be of little consequence for a traditional dictionary, but in a spoken language dictionary it is necessary to adopt one of the following solutions:

Use the canonical form , but mark it as non-occurring.
Adopt an attested form as canonical form (e.g. nouns occurring only in the plural such as French `ténèbres' darkness, English `trousers', German `Leute', people).

At a more detailed level, orthography (the division of word forms into standardised units of writing) and phonology (the division of word forms into units of pronunciation) are related in different ways in different languages and also to the morphology (the division of word forms into units of sense) of the language. The orthographic notion of `syllable ' serves, in general, in written language lexica for defining hyphenation at line-breaks and certain spelling rules; for this purpose, morphological information about words is also generally required. In spoken language, however, the phonological notion of `syllable ' is different; it refers to units of speech which are basic to the definition of the well-formed sound sequences of a language and to the rhythmic structure of speech, and forms the basis for the definition of variant pronunciations of speech sounds.

Morphology is often described at an abstract level which permits generalisations over spoken language and written language forms. However, when complex word forms are put together from combinations of smaller units, different alternations of orthographic units (letters) and phonological units (phonemes ) occur at the boundaries of the parts of such words. Furthermore, additional kinds of lexical unit are required in the lexicon of a spoken language dialogue system, such as discourse particles, hesitation phenomena, and pragmatic idioms , or so-called functional units (sequences of functional words which behave as a phonological unit) and clitics (functional words which combine with lexical words to form a sequence which behaves as a phonological unit).

Basic lexicographic coverage criteria

Criteria for the coverage of lexica for spoken language processing systems are heavily corpus determined, and differ considerably from criteria for coverage of lexica for traditional computational linguistics and natural language processing. In theoretical computational linguistics, interests are determined by systematic fragments of natural languages which reveal interesting problems of representation and processing. In natural language processing, broad coverage is often the goal. In spoken language lexica as currently used in speech technology, lexica are always oriented towards a particular well-defined corpus, which has often been specifically constructed for the task in hand. When speech technology and natural language specialists meet, for instance in comprehensive dialogue oriented development projects, these differences of terminology and priorities are a potential source of misunderstanding and disagreement, and joint solutions need to be carefully negotiated.

The main coverage criteria for spoken language lexica may be summarised as follows.

Completeness (all word types in the training and test corpora).
Minimality (only word types in the training and test corpora).
Consistency (with respect to the training and test corpora and other related data types).
Generality (projection of the word form set on to related forms not in the corpus).

These criteria pertain to words; if other units are used, the criteria apply analogously to these.

The first three criteria are essentials for the current state of speech technology. Conventional expectations in written language processing (in Computational Linguistics and in Natural Language Processing) are widely different, and are expressed in the fourth criterion. Clearly the second and fourth criteria clash; the relation to relevant corpora must therefore be carefully flagged in a spoken language lexicon. The degree of coverage (which for a speech recognition system generally has to be 100%) is expressed for lexica in general in terms of the notions of degree of static coverage (ratio of words in a corpus which are contained in a given dictionary to the number of words in the corpus) and the degree of dynamic coverage (the probability of encountering words which have previously been encountered); the latter value is generally higher than the former (cf. =1 (

; Ferrané et al. 1992) ). On the basis of corpus statistics for typologically different languages such as English (cf. =1 (

; Averbuch Bahl Bakis 1987) ) and French (cf.

=1 (

; Mérialdo 1988) ), two languages which differ widely in their inflectional structure (English with few verbal inflections , French with a rich verbal inflection system), interesting quantitative comparisons can be made.

The lexicon in spoken language recognition systems

In a spoken language recognition system, the lexicon is generally divided into two components: the recognition component and the search component. In the recognition component, intervals of the speech signal are mapped by probabilistic systems such as Hidden Markov Models , Neural Networks , Dynamic Programming algorithms, Fuzzy Logic knowledge bases, to word hypotheses; the resulting mapping is organised as a word lattice or word graph, i.e. a set of word hypotheses, each assigned to a temporal interval in the speech signal. The term word is used here in the sense of `lexical lookup key'. The keys are traditionally represented by orthography , but would be better represented in a spoken language system by phonemic transcriptions . The search component has the task of finding the correct word form from the set of hypotheses delivered by the recognition component, and the correct lexical properties of this word form.

In spoken language recognition system development, a corpus based lexicon of orthographically transliterated forms is used as the basis for a pronunciation lexicon; the lexicon is often supplemented by rules for generating pronunciation variants. This lexicon is required in order to tune the recognition system to a specific corpus by statistical training: frequencies of distribution of words in a corpus are interpreted as the a priori probabilities of words in a given context. These a priori probabilities may be based on the absolute frequencies of words, or on their frequencies relative to a given context, e.g. digram ( bigram) frequencies. The functionality of spoken language lexica may be summarised in the following terms.

Offline functions
- Fully inflected word form list construction
- Pronunciation table construction
  - Synthesiser development
  - Recogniser (forced alignment, stochastic training)
  - Transliteration checking
  - Integration of word and sentence prosody
  - Integration of morphology for
    - describing new words
    - sparse data training with stems and word classes
- Frequency table construction
- Word distribution frequency tables
- Language models (n-gram tables)
- Coverage definition for inter-project coordination
Online functions
- Definition of forms for word graph interface
  - orthographic (conventional technology)
  - phonological (new architectures)
- Definition of criteria for lexicon architecture and lookup

Recommendations on defining spoken language lexica

Define the anticipated functions of the offline lexical database in spoken language system development and the online system lexicon, bearing in mind the differences between written text and speech, the intended application, and possible economies to be gained in using or creating reusable lexical resources.
Speficy the quantitative coverage (size) of the lexical database and of the system lexicon with reference to the available system components and in terms of .
Specify the qualitative coverage (content) of the lexical database and of the system lexicon with reference to the application domain and in terms of standard types of lexical information.

Next: Types of lexical Up: Spoken Language Lexica Previous: Introduction

WWW Administrator
Fri May 19 11:53:36 MET DST 1995