Next: Lexical surface information Up: Spoken Language Lexica Previous: What is a

Types of lexical information in spoken language lexica

Lexicon models and lexical representation

A given system lexicon or lexical database is based on a lexical information model or a data model; often the model is intuitively constructed, or based on notions taken from traditional school grammar, but scientifically motivated models are becoming available. A model of lexical information will make at least the following distinctions:

Lexical objects: The basic objects (such as words) described in a lexicon. It is becoming customary in lexicography and computational linguistics to refer to the lexical sign, i.e. an object associated with attributes denoting orthogonal kinds of lexical information. A second kind of lexical object is the lexical sign class in which similar lexical objects are grouped together, each characterised by subsets of the lexical information required to characterise specific lexical signs. These class-based generalisations may be organised in terms of implication rules ( redundancy rules), or type hierarchies, or default inheritance hierarchies.
Lexical information: In a theoretically well-founded lexicon which satisfies formal criteria of consistency and coverage criteria such as empirical completeness and soundness, types of lexical information are orthogonal, i.e. of different types which complement each other. These orthogonal types of lexical information are often labelled with attribute names, and the items of information regarded as the values of these attributes. The types include orthography, pronunciation, syntactic distributional properties, meaning, and pragmatic properties of use in context (e.g. speech act type, stylistic level).

Lexicon models for lexical databases and system lexicons are part of the overal conceptual framework required for lexicon development. Modern approaches to lexicon development provide suitable lexical representation languages for formulating and integrating the different kinds of lexical information specified in a lexicon model and assigning them to lexical objects, and implementations for these representation languages (cf. =1 (

; Andry et al. 1992) ). In recent work, the following useful distinctions are sometimes made:

Lexicon formalism: A specially designed logic programming language such as DATR, though high level languages such as LISP or Prolog are often used, with compiler concepts for translating these languages into conventional languages for efficient processing. Particular styles of programming in more traditional languages such as C are sometimes used directly for smaller lexicons.
Lexicon theory: A coherent and consistent set of expressions formulated in a well-defined formalism and interpreted with respect to a lexicon model.
- General lexicon theory: A general theory of lexical objects and information, for instance a theory of lexical signs and their representation.
- Specific lexicon theory: A given lexicon formulated in a lexicon formalism on the basis of a lexicon model.
Lexicon model: Specification of the domain denoted by a lexicon theory, conceptually independent of the theory itself (cf. the notion of a data model for a database). A different definition is also common: the general structure of the objects and attribute-value structures in a formal lexicon. A lexicon model specifies the following kinds of information:
- Types of lexical object and structure of lexical entries.
- Types of lexical information associated with lexical objects in lexical entries.
- Relations between lexical objects and structure of the lexicon as a whole lexicon architecture.

The aspects of representation and architecture will be dealt with in a later section. The following subsections are concerned with the main kinds of lexical information required for spoken language lexical entries.

A simple sign model for lexical properties

Lexical information is often regarded as a heterogeneous collection of idiosyncratic information about lexical items. An assumption such as this makes it hard to discuss lexical information systematically and, moreover, from the point of view of contemporary lexicography, it is wrong. For this reason, a simple unifying informal model of lexical signs, related to a view which is current in computational linguistics and computational lexicography, is introduced for the purpose of further discussion.

In general terms, a sign is a constituent of a pattern in communication with identifiable form and meaning. Lexical signs are words or perhaps also phrasal idioms and other items such as dialogue control particles ( er, uhm, aha etc.); it may also be argued that even smaller units such as morphemes also have sign structure. Possible lexical signs range, essentially, over fully inflected word forms, morphs (roots , affixes ), stems (roots or stems to which affixation has applied), lemmas (or lemmata), often represented by an orthographic form, and phrasal items (idioms ).

Signs are characterised by the following four basic types of information:

Surface properties: orthographic and phonological representation; for pronunciation, several different levels of transcription are possible (morphophonemic , phonemic , phonetic ).
Semantic properties: semantic and pragmatic representation.
Distributional properties: syntactic category and subcategory (e.g. Verb, Transitive Verb).
Compositional properties: head and modifier ( complement or specifier) constituents; word formation is recursive:

The first two types are referred to as interpretative properties, since they interpret the basic sign representation in terms of the real world of phonetics (or writing) and meaning, while the second two types are referred to as structural (or syntactic, in a general sense of the term) properties. Complex signs are constructed compositionally from their constituent signs and derive their properties compositionally from these.

The following sections will be devoted to the four main types of lexical information, referring to them as surface, content, grammatical and morphological information, respectively.

The model used here for describing lexical information is based on the widespread current view of linguistic units as language signs and sign components. A number of complexities occur in the relations between the types of information, which will be discussed in a later section.

In the examples below, a basic computer-readable attribute-value syntax is used, based on the kind of spoken language lexical representation in DATR used by =1 (

; Andry et al. 1992) . The name of the lexical sign is written with an initial upper case letter and followed by a colon, attribute names can be either word-like atoms or sequences of atoms (in the latter case, permitting an indirect representation of more complex attribute structures); they are enclosed in corner brackets and separated from their values by an equality sign, and the lexical sign is terminated by a period. The SAMPA notation is introduced below and is defined in the Appendices; see also the chapter on Spoken Language Corpora.

Table:
  <surface orthography>     = table
  <surface phonetics
sampa> = teIbl
  <semantics>               = artefactual horizontal surface
  <distribution>            = noun common countable
  <composition>             = simplex z_plural.

In the case of complex signs, the meaning of the sign is a function of the meanings of its parts and the pronunciation of the sign is a function of the pronunciations of its parts. These functions may be somewhat idiosyncratic with lexical signs; this is shown by the pronunciation and meaning of words like English ``dustman''.

Dustman:
 <surface orthography>      = dustman
 <surface phonetics sampa>  = dVsm@n
 <semantics>                = `municipal garbage collector'.

The pronunciation and meaning of this complex lexical sign are not in all respects a general compositional function of its parts:

Dust:
  <surface phonetics sampa> = dVst
  <semantics>               = `just visible
particles of solid matter'.

Man:
  <surface phonetics sampa> = m{n
  <semantics>               = `male adult human being'.

In contrast, the spelling and the distribution of the complex sign are perfectly regular functions of the spellings of the parts and the distribution of the head (i.e. Man) of the sign.

In perfectly regular cases, there is no necessity to include complex words in the lexicon. Such cases are practically non-existent, however, since complex words are in general partially idiosyncratic ; in a comprehensive spoken language lexicon, both these and their parts need to be included. For most current practical purposes, in which potential words (unknown words or ad hoc word formations) do not need to be treated in addition to actual words (those contained in a lexicon), complex words can be listed in full as unanalysed forms.

Modern lexicographic computational lexicographic practice attempts to reduce the redundancy in a lexicon as far as possible; fully regular information in compounds can be inherited from the parts of the compounds, while idiosyncratic information is specified locally. In a case like this, a lexical class is specified for defining the structure of compounds, and ``inheritance pointers'' are included. The result is a hierarchical lexicon structure. Reference should be made to the literature cited in this section and in the section on Lexicon structure for further information on this approach to lexical lingware development.

Lexical units

Kinds of lexical unit

Intuitively, the prototypic lexical unit is a word. This definition has a number of catches to it, however, because words are not as simple as they seem. The intuitive notion of word has ``fuzzy edges'', as in the following cases.

Words may contain other words (e.g. compound words such as database, Sprachtechnologie).
Words have different status in respect of their phonetic realisations and their meaning; compare the difference between function words, e.g. to, for with reduced pronunciations and structural meanings, and content words, e.g. word, spell, which refer to real world objects, properties, event types.
Words may be merged with other words in informal speech ( cliticisation). Examples of clitics are English 'is in he's --- /hi:z/, French l' in il l'a vu --- /il la vy/, German 'm in auf'm Tisch --- /aUfm tIS/.
Particular types of word formation such as spelling and acronym formation may require special attention: ecu --- /ik"ju/, /isi"ju/.
Words may be inflected word forms, making sound singular and sounds plural into different words; or words may be regarded as classes of inflectionally related forms (a paradigm), i.e. sound and sounds belong to the same word, which may be characterised by a canonical inflected form (e.g. nominative singular), or by the stem shared by the forms and identified by linguistic analysys, or by a number or other abstract label. In speech technology, the inflected word form is the standard definition; in standard dictionaries, the paradigm definition of word is used, and represented by a headword or lemma, generally the canonical inflectional form such as nominative singular, in orthographic representation.
Lexical units may need to be larger than the word (e.g. phrasal idioms ).
Lexical units may need to be smaller than the word: semantically oriented morphological word sub-units (word constituents) include:
- word stems minus inflections; indivisible word stems are lexical morphemes).
- Constituent words words formed by compounding (composition)
- Constituent prefixes, stems and suffixes in words formed by derivation
Pronunciation oriented phonological word sub-units include syllables and their parts; phonological sub-units do not necessarily correspond closely with morphological sub-units.
The precise definition of the intuitively given concept of a word is theoretically more differentiated: linguistic textbooks distinguish between several different views of word-level lexical units, depending on which kind of information is regarded as primary:
- The phonological word (based on its conformity to the phonological structure of a language).
- The prosodic word , based on its conformity to the accentuation and the rhythm patterning of the language.
- The orthographic word (for instance, as delimited by spaces or punctuation marks).
- The morphological word (based on the indivisibility and fixed internal structure of words).
- The syntactic word (based on its distribution in sentences).
The lexical word as a type, as opposed to an occurrence of the type in larger units, and a token of the type in a corpus of speech or writing.

The central meaning for the purpose of spoken language lexica will be taken to be the morphological word.

Lexical units (entries, items) are assigned sets of properties which identify lexical signs and determine the organisation of the lexicon. In practical contexts, the choice of lexical unit for a given lexicon is decisive for procedural reasons, i.e. in determining ways in which a lexicon may be most easily accessed: through orthography , pronunciation, meaning, syntactic properties, or via its morphological properties (stem, inflection ). The application-driven decision on the kind of lexical unit for a given purpose is a non-trivial one. However, for many practical purposes fairly straightforward guidelines can be given. The form of a lexical item, in particular its orthography , is often used as the main identifying property for accessing the lexicon. However, access on phonetic grounds, via the phonological form, is evidently the optimal procedure for speech recognition, and access on conceptual semantic or syntactic grounds is evidently the optimal procedure for speech synthesis. The use of orthography as an intermediate stage in speech recognition is a useful and widespread heuristic which generally does not introduce significant numbers of artefacts into the mapping from speech signals to lexical items, but is not recommended for complex systems with large vocabularies, except as a means of visualisation in user interfaces. For text-to-speech applications, of course, orthography is the optimal lexical access key.

Fully inflected form lexica

It has already been noted that fully inflected form lexical and lexical databases are fairly standard for speech recognition. Where a small closed vocabulary is used, and new, unknown or ad hoc word formations are not required (as in most current applications of speech synthesis and recognition), fully inflected word forms are listed. This procedure is most convenient in languages with very small inflectional paradigms ; for languages of the agglutinative type, in which large numbers of inflectional endings are concatenated, the procedure rapidly becomes intractable. In other applications, too, such as speech synthesis, it may be more tractable to generate fully inflected word forms from stems and endings.

An example of a language with few inflections is English, where (except for a few pronouns) only nouns and verbs are inflected , and even here only three forms for nouns (uninflected , genitive and plural) and four for verbs (uninflected , third person singular present, past, and present participle; irregular verbs in addition have a different past participle form). English is therefore not a good example for illustrating inflectional morphology (in other areas of morphology, i.e. in word formation, languages appear to be equally complex).

French is much more complex, including inflections on adjectives and large verb paradigms; orthographic inflection in French has more inflectional endings than are distinguished in phonological inflection .

German also has a complex inflectional morphology , with significantly more endings on all articles, pronouns, nouns, adjectives and verbs, increasing the size of the vocabulary over the size of a stem -oriented lexicon by a factor of about 4.

In extremely highly inflecting languages such as Finnish, the number of endings and the length of sequences of endings multiply out to increase the vocabulary by a factor of over 1000. Special morphological techniques have been developed (e.g. two-level morphology ) to permit efficient calculation of inflected forms and to avoid finite but intractable explosion of lexicon size for highly inflecting languages (cf. =1 (

; Koskenniemi 1983) and =1 (

; Karttunen 1983) ). These techniques have so far not been applied to any significant extent in speech technology.

The figures cited refer only to the sets of forms. When the form-function mapping, i.e. the association of a given inflected form with a morphosyntactic category, is considered, the figures become much worse. A single inflected adjective form such as guten in German has 44 possible interpretations which are relevant for morphosyntactic agreement contexts (cf.: =1 (

; Gibbon 1995) ), with 13 feminine readings, 17 masculine readings, and 14 neuter readings, depending on different cases ( nominative, accusative, genitive and dative) and different determiner (article) categories ( strong, weak and mixed). It is possible to reduce the size of these sets by means of default-logical abbreviations in a lexical database, but for efficient processing, they ultimately need to be multiplied out. Similar considerations apply to the other word categories, and to the other highly inflecting languages.

The inflectional properties of many languages imply that, for these languages, large vocabulary systems with complex grammatical constructions require prohibitively large fully inflected form inventories. Although the sets of mappings involved can be very large, the inflectional systems of languages define a finite number of variants for each stem, and therefore it may make sense in complex applications to define a rule-based ``virtual lexical database'' or a ``virtual lexicon'' which constructs or analyses each fully inflected word form on demand using a stem or morph lexicon with a morphological rule component.

Stem and morph lexica

Lexica based on the morphological parts of words, coupled with lexical rules for defining the composition of words from these parts, are not widely used in current speech recognition practice. They are useful, however, in expanding lexica of attested forms to include all fully inflected forms, and in tools which verify the consistency of corpus transcriptions and lexica.

Terminology in this area is somewhat variable. In the most general usage, a stem is any uninflected item, whether morphologically simple or complex. However, intermediate stages in word formation by affixation , and in the inflection of highly inflected languages, are also called stems . The smallest stem is a phonological morph or an orthographic morph , i.e. the phonological or orthographic realisation of a lexical morpheme . Since stems may vary in different inflectional contexts, as affixes do, it is necessary to include information about the morphophonological (and morphographemic ) alternations of such morphemes :

Knife:
  <surface
phonology singular>   = naIf
  <surface phonology plural>     = naIv + z
  <surface orthography singular> = knife
  <surface orthography plural>   = knive + s.

The use of morphological decomposition of the kind illustrated here has been demonstrated to bring advantages in medium size vocabulary speech recognition ( =1 (

; Geutner 1995) ).

In a stem lexicon, the basic lexical key or lemma is the stem, which is represented in some kind of normalised notation. The most common kind of normalised or canonical notation has the following two properties:

Canonical inflected form: With morphologically inflected items, a ``normal form'' such as the infinitive for verbs or the nominative singular for nouns is used.
Canonical orthography: A standardised orthographic representation of the canonical inflected form is used.

For specific purposes, in which lexical entries need to be accessed on the basis of a specific property, indexing based, for instance, on the canonical phonemic representation, either of a fully inflected form or of the canonical inflected form, or even of the stem itself, may be used. This question is dealt with in more detail below.

The notion of `lexical lemma'

As in the knife example, one particular form, for instance orthographic , of an entry is often used as a headword or lemma . From a technical lexicographic point of view, this form then has a dual function:

It names the entry.
It also represents one of its properties, namely its spelling.

In modern spoken language lexicography, this distinction is central, and ignoring it may lead to confusion. This applies particularly in the context of spoken language lexicography, where the primary criterion of acces by word form is phonological. The concept of an abstract lemma , deriving from recent developments in computational linguistics and their application to phonology and prosody , may be used in order to clarify the distinction (cf. =1 (

; Gibbon 1992) ): an abstract lemma may have any convenient unique name or number (or indeed be labelled by the spelling of the canonical inflected form, as already noted); all properties have equal status, so that the abstract lemma is neutral with respect to different types of lexical access, through spelling, pronunciation, semantics , etc. The examples of lexical entries given so far are based on the concept of an abstract lemma . The neutrality of the abstract lemma with respect to particular properties and particular directions of lexical access make it suitable as a basic concept for organising flexible lexical databases. A lexicon based on a neutral abstract lemma concept is the basic form of a declarative lexicon, in which the structure or the lexicon is not dictated by requirements of specific types of lexical access (a procedural lexicon, but by general logical principles. The distinction between declarative and procedural lexica is a relative one, however, which is taken up in the section on spoken language lexicon architectures.

Lexical properties and lexical relations in spoken language

The relations between orthographic , phonological, syntactic and semantic properties of lexical units are complex, and make a theoretically satisfying definition of ``lexical sign'' quite elusive. Lexical relations are either paradigmatic, and define classes of similar items, or syntagmatic, and define whole items in terms of relations between their parts.

Present discussion is restricted to the main paradigmatic relations in traditional terms; the expression of these relations in terms of semantic features, semantic markers or semantic components is not dealt with explicitly, though it figures implicitly in the notion of attribute-value structures which is referred to in the examples.

The syntagmatic relations (roles; collocational relations) are more controversial, and are not dealt with here. For further information on these and on semantic components, reference should be made to the results of the EAGLES Computational Lexica Working Group and to standard semantics textbooks such as =1 (

; Lyons 1977) or =1 (

; Cruse 1986) .

The following systematised versions of traditional definitions express the main paradigmatic relations between lexical signs.

The main relations of form between lexical signs:
Homonymy :
Two words with the same orthographic and phonological forms, but different syntactic categories and/or meanings are homonyms . Example: mate /meIt/ `friend' or `state of play in a chess game'.
Homography :
Two words with the same orthographic form and different phonological forms are (heterophonic ) homographs . Example: row /roU/ `horizontal sequence', /raU/ `noise, quarrel'.
Homophony :
Two words with the same phonological form and different orthographic forms are (heterographic ) homophones . Example: meet /mi:t/ `encounter' --- meat /mi:t/ `edible animal tissue'.
Heterography :
Two orthographic forms of the same word are heterographs . Example: standardise --- standardize /st{nd@daIz/.
Heterophony :
Two phonological forms of the same word are heterophones . Example: either /aID@/ --- /i:D@/ `disjunction'.
The main relations of function between lexical signs:
Hyperonymy :
If the meaning of one word is entailed by the meaning of another, it is a hyperonym of the other (a superordinate term relative to the other). Example: book is a hyperonym of manual as the meaning of book is implied by the meaning of manual (in one of its meanings).
Hyponymy :
The converse of hyperonym . If the meaning of one word entails the meaning of another, it is a hyponym of the other (a subordinate term relative to the other). Example: manual is a hyponym of book as the meaning of manual implies the meaning of book.
Co-hyponymy :
Two words are co-hyponyms if and only if there is a word which is a hyperonym of each (in the same meaning of this word). Example: manual and novel are co-hyponyms in relation to book.
Synonymy :
Two words are synonyms if and only if they have the same meaning (or at least have one meaning in common), i.e. if the meaning of each entails the meaning of the other. They are partial synonyms if either has additional readings not shared by the other. They are full synonyms if they have no readings which is not shared by the other. Example: manual and handbook are partial synonyms ( manual is also, among other things, a term for a traditional organ keyboard). Full synonyms are rare. By implication, synonyms are also co-hyponyms .
Antonymy :
Two words are antonyms (a) if they are co-hyponyms with respect to given meanings, and (b) if they differ in meaning in respect of those details of the same meaning which are not shared by their hyperonym . Example: manual and novel are antonyms . Note that the term is sometimes restricted to binary oppositions, e.g. dead --- alive.

In addition to these lexical relations, there are a number of structural complexities which hold between different types of information.

Semantically, recursion in word formation is unrestricted, with left- or right branching, or centre-embedding.
Morphologically, recursion is restricted to flat, linear concatenation, as in Spracherkennungsevaluationsmethode or operationalisation, which can be efficiently described and implemented by finite state devices.
So-called bracketing paradoxes occur because of the different structures determined by semantics and phonology; the most well-known example is transformational grammarian, semantically bracketed as (( transformational grammar) ian), morphologically bracketed as ( transformational ( grammar ian)).
Note, too, that morphological (lexical, semantic oriented) bracketings may not correspond with non-lexical, phonologically motivated bracketings, as in operate -- tion and o -- pe -- ra -- tion; the latter cuts across the former.

Recommendations on types of lexical information

Distinguish between the declarative aspect of types of lexical information in a lexicon or a lexical database (e.g. orthography, pronunciation, syntactic category) and the procedural aspects of how to use this information in lexicon production and lexical access of a lexicon or a lexical database.
Specify an explicit lexicon model for the structure of lexical entries and the overall structure of the lexicon, both for the lexical database and for the system lexicon.
Define the basic lexical entry types or lexical objects (e.g. fully inflected forms, morphological stems, conceptual units), and the notation for lexical keys.
Distinguish clearly between the status of orthography as a property of written language, canonical phonemic representation as a property of spoken language, and the lexical key as a unique identifier for lexical entries (which may often be the orthography of a canonical inflected form, supplemented by a serial number).
Define explicit specifications for the notation of orthography and for canonical phonemic representation; for the latter, the SAMPA alphabet is recommended (see the following section, and Appendix A).
First specify the lexicon model suitable for the intended applications, then select formalisms appropriate to the model, i.e. representation conventions for the lexical database and the system lexicon.
Do not ``programme'' the lexical information directly in an existing formalism without first explictly specifying the lexicon model.

Next: Lexical surface information Up: Spoken Language Lexica Previous: What is a

WWW Administrator
Fri May 19 11:53:36 MET DST 1995