A given system lexicon or lexical database is based on a lexical information model or a data model; often the model is intuitively constructed, or based on notions taken from traditional school grammar, but scientifically motivated models are becoming available. A model of lexical information will make at least the following distinctions:
Lexicon models for lexical databases and system lexicons are part of the overal conceptual framework required for lexicon development. Modern approaches to lexicon development provide suitable lexical representation languages for formulating and integrating the different kinds of lexical information specified in a lexicon model and assigning them to lexical objects, and implementations for these representation languages (cf. =1 (
; Andry et al. 1992) ). In recent work, the following useful distinctions are sometimes made:
The aspects of representation and architecture will be dealt with in a later section. The following subsections are concerned with the main kinds of lexical information required for spoken language lexical entries.
Lexical information is often regarded as a heterogeneous collection of idiosyncratic information about lexical items. An assumption such as this makes it hard to discuss lexical information systematically and, moreover, from the point of view of contemporary lexicography, it is wrong. For this reason, a simple unifying informal model of lexical signs, related to a view which is current in computational linguistics and computational lexicography, is introduced for the purpose of further discussion.
In general terms, a sign is a constituent of a pattern in communication with identifiable form and meaning. Lexical signs are words or perhaps also phrasal idioms and other items such as dialogue control particles ( er, uhm, aha etc.); it may also be argued that even smaller units such as morphemes also have sign structure. Possible lexical signs range, essentially, over fully inflected word forms, morphs (roots , affixes ), stems (roots or stems to which affixation has applied), lemmas (or lemmata), often represented by an orthographic form, and phrasal items (idioms ).
Signs are characterised by the following four basic types of information:
The first two types are referred to as interpretative properties, since they interpret the basic sign representation in terms of the real world of phonetics (or writing) and meaning, while the second two types are referred to as structural (or syntactic, in a general sense of the term) properties. Complex signs are constructed compositionally from their constituent signs and derive their properties compositionally from these.
The following sections will be devoted to the four main types of lexical information, referring to them as surface, content, grammatical and morphological information, respectively.
The model used here for describing lexical information is based on the widespread current view of linguistic units as language signs and sign components. A number of complexities occur in the relations between the types of information, which will be discussed in a later section.
In the examples below, a basic computer-readable attribute-value syntax is used, based on the kind of spoken language lexical representation in DATR used by =1 (
; Andry et al. 1992) . The name of the lexical sign is written with an initial upper case letter and followed by a colon, attribute names can be either word-like atoms or sequences of atoms (in the latter case, permitting an indirect representation of more complex attribute structures); they are enclosed in corner brackets and separated from their values by an equality sign, and the lexical sign is terminated by a period. The SAMPA notation is introduced below and is defined in the Appendices; see also the chapter on Spoken Language Corpora.
Table: <surface orthography> = table <surface phonetics sampa> = teIbl <semantics> = artefactual horizontal surface <distribution> = noun common countable <composition> = simplex z_plural.
In the case of complex signs, the meaning of the sign is a function of the meanings of its parts and the pronunciation of the sign is a function of the pronunciations of its parts. These functions may be somewhat idiosyncratic with lexical signs; this is shown by the pronunciation and meaning of words like English ``dustman''.
Dustman: <surface orthography> = dustman <surface phonetics sampa> = dVsm@n <semantics> = `municipal garbage collector'.
The pronunciation and meaning of this complex lexical sign are not in all respects a general compositional function of its parts:
Dust: <surface phonetics sampa> = dVst <semantics> = `just visible particles of solid matter'. Man: <surface phonetics sampa> = m{n <semantics> = `male adult human being'.
In contrast, the spelling and the distribution of the complex sign are perfectly regular functions of the spellings of the parts and the distribution of the head (i.e. Man) of the sign.
In perfectly regular cases, there is no necessity to include complex words in the lexicon. Such cases are practically non-existent, however, since complex words are in general partially idiosyncratic ; in a comprehensive spoken language lexicon, both these and their parts need to be included. For most current practical purposes, in which potential words (unknown words or ad hoc word formations) do not need to be treated in addition to actual words (those contained in a lexicon), complex words can be listed in full as unanalysed forms.
Modern lexicographic computational lexicographic practice attempts to reduce the redundancy in a lexicon as far as possible; fully regular information in compounds can be inherited from the parts of the compounds, while idiosyncratic information is specified locally. In a case like this, a lexical class is specified for defining the structure of compounds, and ``inheritance pointers'' are included. The result is a hierarchical lexicon structure. Reference should be made to the literature cited in this section and in the section on Lexicon structure for further information on this approach to lexical lingware development.
Intuitively, the prototypic lexical unit is a word. This definition has a number of catches to it, however, because words are not as simple as they seem. The intuitive notion of word has ``fuzzy edges'', as in the following cases.
The central meaning for the purpose of spoken language lexica will be taken to be the morphological word.
Lexical units (entries, items) are assigned sets of properties which identify lexical signs and determine the organisation of the lexicon. In practical contexts, the choice of lexical unit for a given lexicon is decisive for procedural reasons, i.e. in determining ways in which a lexicon may be most easily accessed: through orthography , pronunciation, meaning, syntactic properties, or via its morphological properties (stem, inflection ). The application-driven decision on the kind of lexical unit for a given purpose is a non-trivial one. However, for many practical purposes fairly straightforward guidelines can be given. The form of a lexical item, in particular its orthography , is often used as the main identifying property for accessing the lexicon. However, access on phonetic grounds, via the phonological form, is evidently the optimal procedure for speech recognition, and access on conceptual semantic or syntactic grounds is evidently the optimal procedure for speech synthesis. The use of orthography as an intermediate stage in speech recognition is a useful and widespread heuristic which generally does not introduce significant numbers of artefacts into the mapping from speech signals to lexical items, but is not recommended for complex systems with large vocabularies, except as a means of visualisation in user interfaces. For text-to-speech applications, of course, orthography is the optimal lexical access key.
It has already been noted that fully inflected form lexical and lexical databases are fairly standard for speech recognition. Where a small closed vocabulary is used, and new, unknown or ad hoc word formations are not required (as in most current applications of speech synthesis and recognition), fully inflected word forms are listed. This procedure is most convenient in languages with very small inflectional paradigms ; for languages of the agglutinative type, in which large numbers of inflectional endings are concatenated, the procedure rapidly becomes intractable. In other applications, too, such as speech synthesis, it may be more tractable to generate fully inflected word forms from stems and endings.
An example of a language with few inflections is English, where (except for a few pronouns) only nouns and verbs are inflected , and even here only three forms for nouns (uninflected , genitive and plural) and four for verbs (uninflected , third person singular present, past, and present participle; irregular verbs in addition have a different past participle form). English is therefore not a good example for illustrating inflectional morphology (in other areas of morphology, i.e. in word formation, languages appear to be equally complex).
French is much more complex, including inflections on adjectives and large verb paradigms; orthographic inflection in French has more inflectional endings than are distinguished in phonological inflection .
German also has a complex inflectional morphology , with significantly more endings on all articles, pronouns, nouns, adjectives and verbs, increasing the size of the vocabulary over the size of a stem -oriented lexicon by a factor of about 4.
In extremely highly inflecting languages such as Finnish, the number of endings and the length of sequences of endings multiply out to increase the vocabulary by a factor of over 1000. Special morphological techniques have been developed (e.g. two-level morphology ) to permit efficient calculation of inflected forms and to avoid finite but intractable explosion of lexicon size for highly inflecting languages (cf. =1 (
; Koskenniemi 1983) and =1 (
; Karttunen 1983) ). These techniques have so far not been applied to any significant extent in speech technology.
The figures cited refer only to the sets of forms. When the form-function mapping, i.e. the association of a given inflected form with a morphosyntactic category, is considered, the figures become much worse. A single inflected adjective form such as guten in German has 44 possible interpretations which are relevant for morphosyntactic agreement contexts (cf.: =1 (
; Gibbon 1995) ), with 13 feminine readings, 17 masculine readings, and 14 neuter readings, depending on different cases ( nominative, accusative, genitive and dative) and different determiner (article) categories ( strong, weak and mixed). It is possible to reduce the size of these sets by means of default-logical abbreviations in a lexical database, but for efficient processing, they ultimately need to be multiplied out. Similar considerations apply to the other word categories, and to the other highly inflecting languages.
The inflectional properties of many languages imply that, for these languages, large vocabulary systems with complex grammatical constructions require prohibitively large fully inflected form inventories. Although the sets of mappings involved can be very large, the inflectional systems of languages define a finite number of variants for each stem, and therefore it may make sense in complex applications to define a rule-based ``virtual lexical database'' or a ``virtual lexicon'' which constructs or analyses each fully inflected word form on demand using a stem or morph lexicon with a morphological rule component.
Lexica based on the morphological parts of words, coupled with lexical rules for defining the composition of words from these parts, are not widely used in current speech recognition practice. They are useful, however, in expanding lexica of attested forms to include all fully inflected forms, and in tools which verify the consistency of corpus transcriptions and lexica.
Terminology in this area is somewhat variable. In the most general usage, a stem is any uninflected item, whether morphologically simple or complex. However, intermediate stages in word formation by affixation , and in the inflection of highly inflected languages, are also called stems . The smallest stem is a phonological morph or an orthographic morph , i.e. the phonological or orthographic realisation of a lexical morpheme . Since stems may vary in different inflectional contexts, as affixes do, it is necessary to include information about the morphophonological (and morphographemic ) alternations of such morphemes :
Knife: <surface phonology singular> = naIf <surface phonology plural> = naIv + z <surface orthography singular> = knife <surface orthography plural> = knive + s.
The use of morphological decomposition of the kind illustrated here has been demonstrated to bring advantages in medium size vocabulary speech recognition ( =1 (
; Geutner 1995) ).
In a stem lexicon, the basic lexical key or lemma is the stem, which is represented in some kind of normalised notation. The most common kind of normalised or canonical notation has the following two properties:
For specific purposes, in which lexical entries need to be accessed on the basis of a specific property, indexing based, for instance, on the canonical phonemic representation, either of a fully inflected form or of the canonical inflected form, or even of the stem itself, may be used. This question is dealt with in more detail below.
As in the knife example, one particular form, for instance orthographic , of an entry is often used as a headword or lemma . From a technical lexicographic point of view, this form then has a dual function:
In modern spoken language lexicography, this distinction is central, and ignoring it may lead to confusion. This applies particularly in the context of spoken language lexicography, where the primary criterion of acces by word form is phonological. The concept of an abstract lemma , deriving from recent developments in computational linguistics and their application to phonology and prosody , may be used in order to clarify the distinction (cf. =1 (
; Gibbon 1992) ): an abstract lemma may have any convenient unique name or number (or indeed be labelled by the spelling of the canonical inflected form, as already noted); all properties have equal status, so that the abstract lemma is neutral with respect to different types of lexical access, through spelling, pronunciation, semantics , etc. The examples of lexical entries given so far are based on the concept of an abstract lemma . The neutrality of the abstract lemma with respect to particular properties and particular directions of lexical access make it suitable as a basic concept for organising flexible lexical databases. A lexicon based on a neutral abstract lemma concept is the basic form of a declarative lexicon, in which the structure or the lexicon is not dictated by requirements of specific types of lexical access (a procedural lexicon, but by general logical principles. The distinction between declarative and procedural lexica is a relative one, however, which is taken up in the section on spoken language lexicon architectures.
The relations between orthographic , phonological, syntactic and semantic properties of lexical units are complex, and make a theoretically satisfying definition of ``lexical sign'' quite elusive. Lexical relations are either paradigmatic, and define classes of similar items, or syntagmatic, and define whole items in terms of relations between their parts.
Present discussion is restricted to the main paradigmatic relations in traditional terms; the expression of these relations in terms of semantic features, semantic markers or semantic components is not dealt with explicitly, though it figures implicitly in the notion of attribute-value structures which is referred to in the examples.
The syntagmatic relations (roles; collocational relations) are more controversial, and are not dealt with here. For further information on these and on semantic components, reference should be made to the results of the EAGLES Computational Lexica Working Group and to standard semantics textbooks such as =1 (
; Lyons 1977) or =1 (
; Cruse 1986) .
The following systematised versions of traditional definitions express the main paradigmatic relations between lexical signs.
In addition to these lexical relations, there are a number of structural complexities which hold between different types of information.