Next: Lexical knowledge acquisition Up: Spoken Language Lexica Previous: Lexical content information

Lexicon structure

The area of lexicon structure deals with the organisation of lexical information in lexica. Models for lexical information, and types of lexical information, are dealt with in the preceding sections. Terminology varies considerably in this area. The structure of a spoken language lexicon may be seen from the following points of view:

Lexical formalisms, lexicon representation languages:: Representation conventions of various types (symbolic notations, programming languages, database languages, logical formalisms, purpose-designed knowledge representation languages), which are suitable for formulating lexical models.
Lexicon architecture:: The choice of basic objects and properties in the lexicon, and the structure of the lexicon as a whole, such as a table of items, a trie (decision tree), an inheritance hierarchy, a semantic network, a database.

Spoken language lexicon formalisms

Spoken language lexicon formalisms (representation languages) may be broadly classified according to their use:

Linguistically and phonetically based working notations.
Implementation languages for the operational phase.
Algebraic and logical formalisms for formal definition.

In Speech Technology contexts, often no distinction is made in practice. Where an ad hoc solution is required and where lexicon structure is simple, a lexicon may be written directly in a standard programming language suitable for high-speed runtime applications, traditionally Fortran but more recently C, or in a higher level language such as LISP or Prolog. Recent developments are moving towards knowledge representation languages which are specifically designed to meet all three of the above criteria equally well, in that they are useful working notations, have efficient implementations, and are formally well-defined.

A variety of lexical formalisms are currently used in Spoken Language lexica; some of these are also used for general written language lexica. A more detailed classification of formal representation systems may be given as follows:

General data structures (lists, tables or matrices, tree structures designed for optimal lexical access).
Programming languages (C for efficiency; LISP or Prolog for flexibility).
Database systems.
General text markup languages such as SGML.
Knowledge representation languages (inheritance networks, semantic networks, frame systems).
Linguistic knowledge representation languages, commonly based on attribute-value logics.
Lexical knowledge representation languages (attribute based inheritance formalisms) such as DATR.

General data structures definitions are required for specifying the general processing properties of a system, and are relevant for developers and for theoretical work on the complexity and efficiency of lexica and lexicon processing; standard textbooks on data structures and algorithms should be consulted for this purpose.

Conventional programming languages are generally used for performance reasons in runtime systems. They may also be used to implement small or simple lexica directly, in particular for rapid prototyping of these; this is not optimal software development practice, however, and not to be recommended for developing large or complex lexica, in particular those with highly structured linguistic information.

Database management systems (DBMSs) are widely used for general resource management, including large-scale lexica with rich information which needs to be accessed flexibly and efficiently. In the SAM project, an ORACLE database management concept for spoken language corpora and lexica was developed (cf. =1 (

; Dolmazon Caërou Barry 1990) ).

General text markup languages are used for integration with large, pre-analysed written corpora in the development of natural language processing systems and in statistical basic research in computational linguistics, and so far are not used in spoken language system development (cf. the results of the EAGLES Working Group on Text Corpora). Implementations of SGML are readily available.

Knowledge representation languages (KRLs) are used for developing complex semantic and conceptual knowledge representations, and for integrating spoken language front ends with knowledge based systems; see =1 (

; Schröder Sagerer Niemann 1987) , =1 (

; Sagerer 1990) ; more generally, cf. =1 (

; Bobrow Winograd 1977) , =1 (

; Brachman Levesque 1985) , =1 (

; Charniak McDermott 1985) , =1 (

; De Mori et al. 1984) , =1 (

; Young et al. 1989) .

Linguistic formalisms in general are discussed in the results of the EAGLES Working Group on Grammar Formalisms, which should be referred to in this connection.

Lexical knowledge representation languages (LKRLs) are a relatively new development. They are coming to be used in knowledge acquisition for integrated lexica which contain a variety of complex lexical information from phonology through morphology and syntax to semantics and pragmatics . They provide a means of bridging the gap between complexity of lexical information and easy-to-read representations using sign-based lexicon models. A LKRL which has been used in several natural language processing and language and speech projects is DATR (cf. =1 (

; Evans Gazdar 1989) , =1 (

; Cahill 1993) , =1 (

; Cahill Evans 1990) , =1 (

; Andry et al. 1992) ,

=1 (

; Gibbon 1991) , =1 (

; Gibbon 1993) , =1 (

; Bleiching 1992) , =1 (

; Langer Gibbon 1992) ). This is the language whose syntax conventions for basic attribute-value representations is used in this chapter. A number of public domain DATR implementations are available.

Lexicon architecture and lexical database structure

Lexicon architecture pertains to the choice of basic objects and properties in the lexicon, and to the overall structure of the lexicon. More formally, it defines the relation which assigns lexical properties to lexical entries.

The term ``architecture'' generally refers to the structure of system lexica, but the term is also justified in connection with lexical database structure, particularly when more complex relational or object-oriented structures are concerned.

The basic objects in terms of which an architecture may be defined were discussed in the section on lexical information for spoken language.

The overall structure of a spoken language lexicon is determined by a wide range of declarative , procedural and operational criteria such as the following:

The complexity of the information assigned to lexical entries.
The complexity of the relations defined between lexical entries.
The particular subset of objects and properties defined for a given application lexicon.
Linguistic and logical compression techniques such as redundancy rules or, more generally, inheritance hierarchies.
Task driven directionality of access.
Variety of information required for access (from phonological to pragmatic ).
Performance requirements of software (including lingware ) size and speed of access.
Techniques of acquisition and maintenance (with respect, for instance, consistency).

At the one extreme is the ideal notion of a fully integrated sign-based model with non-redundant specification of entries and property inheritance; in between is the efficient database management system used for large scale lexical databases, and at the other extreme is the simple pronunciation table which is the starting point for the training of speech recognition devices.

The choice of lexicon architecture on the basis of parameters such as those listed above, and taking into account practical constraints from the actual working environment, is application specific. There is no single principle of organisation which applies to all lexica.

The closest approximation to a neutral form of spoken language lexicon organisation is a sign-based general background lexicon organised as a database with flexible access. Such a lexicon is basically knowledge acquisition oriented, and can function as a source for the specialised lexica required for different speech synthesis and recognition applications. Specialised models for sublexica which are optimised for particular applications can then be formulated, and sublexica can be automatically compiled out of the main lexicon into notations and structures.

The organisation of a lexicon determines the general properties of the formalism to be used with the lexicon. Conversely, available formalisms determine tractable forms of lexicon organisation in terms of data structures, algorithms and programming environments (cf. =1 (

; Knuth 1973) , =1 (

; Carbonell Pierrel 1986) , =1 (

; Rudnicky et al. 1987) , =1 (

; Lacouture Normandin 1993) ). Object-oriented system architectures, with local encapsulation of all aspects of representation and processing, permit the construction of hybrid systems with functionally optimised components; by analogy, the lexicon itself can be conceived as a hybrid object system if required.

This is in effect the situation in current speech recognition technology, in which a more or less large set of HMMs representing words, for instance, can be seen as a procedurally sophisticated lexicon with acoustically driven lookup of keys which are then used to access the main lexicon. Although the standard perspective is to see the two components as separate, they can be seen as objects which are both located in an overall hybrid spoken language system lexicon.

Current research on new object-oriented interactive incremental spoken language system architectures raises many questions about the role of a lexicon in such a system. One major question is whether the lexicon is an object (or system of objects) in its own right, or whether the lexicon is distributed over the system components and is thus a virtual lexicon. Questions such as this are very much within the domain of ongoing basic research, and it would be premature to make specific recommendations at this point.

For a broader discussion of lexicon architectures, the work of the EAGLES Working Group on Computational Lexica should be consulted. Further discussion here will be restricted to the elementary case of pronunciation tables.

The architecture of spoken language system lexica

It has already been pointed out that at the current state of the art, spoken language systems tend to deal with the properties of words in a distributed but modular fashion. The following components of a speech recognition system, for example, may be said to constitute a lexicon in the strict sense of the term:

Word recognition as lexical key identification:: The task of identifying lexical keys in written language is straightforward, once the characters have been identified by some analogue input device such as a keyboard or a scanner with OCR (Optical Character Recognition) software and converted to discrete symbolic strings which act as identifiers for lexical access. The strings may be given an optimised representation for the purpose of lexical lookup, for example as a letter discrimination tree, which may be determiniatically processed, i.e. without any search in the general sense. Traditionally, this aspect of written language is not in the centre of attention. In spoken language word recognition, the situation is quite different: the main emphasis is on the process of converting from a continuous analogue signal to a set of symbolic strings, which represent hypotheses about lexicon keys, i.e. identifiers for lexical access; the mapping to a set means a non-deterministic situation which needs to be handled by sophisticated search techniques in order to find the best hypothesis. The techniques, such as signal transformations and Hidden Markov Models, which perform the identification of lexical keys, are, in the strict sense of the term, just as much a part of spoken language lexicography as ASCII codes and letter trees are a part of written language lexicography.
Stochastic language models as syntactic category and subcategory information:: The search process for identifying the best lexical key uses top-down knowledge about the distribution of words in sentence contexts. Distributional properties of words range from general linguistic categories such as Verb, Ditransitive Verb, etc., to the immediate string context consisting of n words, e.g. bigrams, trigrams. This information is not seen in speech technology as being lexical information; from a general lexicographic point of view, however, distributional information about is lexical information, and a repository of such information is part of a lexicon.
Domain information:: The application-specific and, in particular, domain-specific vocabulary and grammar of a speech recogniser is used to restrict search further. In more general terms, vocabulary and grammar restrictions define a sublanguage, technical language, or register of the language, which imposes general top-down constraints on search; in contrast, search involving unrestricted general vocabulary and syntax would be intractable. From a general lexicographic point of view, register constraints constitute pragmatic information about words which has its place in the lexicon.

The lexicon of a speech recogniser may consequently be said from a general lexicographic point of view to be modular, consisting basically of the Lexical Key Identification Module, the basic word recogniser, the Distributional Disambiguation Module, the stochastic language model which is statistically tuned to a given corpus, and the Lexical Lookup Module, which defines the lexical relation between the lexical key and its meaning).

The structure of lexical databases

The main features of spoken language lexical databases have already been discussed. In practice, a spoken language lexical database is often a set of loosely related simpler databases (e.g. pronunciation table, signal annotation file, stochastic word model, and a main lexical database with syntactic and semantic information), and is part of a larger complex of databases involving speech signal files, transcription files (orthographic and phonemic), and annotation (labelling) files which define a function from the transcriptions into the digitised speech signal. However, in the interests of consistency it is helpful to take a more general lexicographic point of view, and see a lexical database for spoken language development as a single database, in which relations between lexical items and their properties at all levels, from acoustics through word structure to syntax, semantics and pragmatics are defined.

The major problem in deciding how to organise a lexical database is the ambiguity of word forms. In a spoken language system, the focus is on the pronunciation, i.e. on phonemic word forms (not the orthography, though this is often used as a conveniently familiar form of representation). The key issue here is homophony, i.e. a phonemic word form associated with at least two different sets of lexical information, and thus logically involving a disjunction in the database.

In a simple traditional database model based on fixed-length records, in which each field represents a specific attribute of the entity which the record stands for, there is a record for each lexical entry associated with a homophone, uniquely identified by a serial number. However, for specific applications such as the training of a speech recogniser it is convenient to have just one record for each word form. In a database which is optimised for this application, the disjunction required by the homphone is within a single record, rather than distributed over alternative records which share the same value for the pronunciation attribute. Disjunctive information of this kind within the lexical database corresponds to nondeterministic situations and the use of complex search algorithms in actual spoken language systems.

A simple database type: pronunciation tables

Pronunciation tables hardly correspond to the intuitive concept of a lexical database, which implies a fairly high degree of complexity, but they are nevertheless a useful example of a basic kind of simple lexical database structure for the purpose of illustrating practical points of representation, modelling and organisation.

Pronunciation tables define the relation between orthographic and phonemic representations of words. Often they are defined as functions which assign pronunciations (frequently a set of variant pronunciations) to orthographic representations; this is an obvious necessity for text-to-speech lexical, but in speech recognition applications in which orthographic transcriptions (which are easier to make and check than phonemic transcriptions) are mapped to phonemic representations for the purpose of training speech recognisers, the use of a pronunciation table of this type is clearly relevant.

A pronunciation table which involves pronunciation variants (see below) provides a simple illustration of the ambiguity problem, with disjunctions within the database.

Pronunciation tables have to fulfil a number of criteria, in particular the criterion of unambiguous notation, of consistency with transliterations and transcriptions of a particular corpus, and of simple and fast processing.

General proposals for the interchange of lexical information about word forms, including morphological , phonological and prosodic information, have been made for different languages; they do not have standard status at the current time, but they are sufficiently similar to justify recommendation. A standard for French has been described (cf. =1 (

; Pérennou De Calmès 1987) , =1 (

; Autesserre Pérennou Rossi 1989) ), containing the following features:

Boundaries:
- morpheme : +
- word: #
- liaisonless group: ##
- phonological syntagma: § (in phrasal entries)
Phonemes (in IPA or SAMPA notation), including a notation for the French archiphonemes .
Phonological diacritics:
- latency mark " for consonants pronounced in liaison contexts or morphological linking
- consonant deletion mark ' (e.g. for final consonants)

For the spoken language lexicon in the German VERBMOBIL project the same basic principle has been adopted (cf. =1 (

; Bleiching Gibbon 1994) ), with extensions for incorporating prosodic information:

Boundaries:
- morpheme : +
- word in compounds : #
- word in phrases: ##
- syllable : .
- primary stress : '
- secondary stress : '' (two single quotes)
Additional conventions:
- The boundaries # and ## each imply coextensive + and . boundaries.
- Where + and . boundaries are coextensive, . is written before +.
- The stress marks '150 and '150 '150 are written immediately before the vowel, not before the syllable .

The following extract from the VERBMOBIL pronunciation table illustrates the VERBMOBIL WIF (Word form Interchange Format) convention; following current practice, it is organised according to orthographic keys.

The convention has been designed to permit the removal of information which is not required, or the selection of useful subsets of the table using simple UNIX tool commands; the use of '150 for primary and secondary stress permits simple generalisation over both. The plain SAMPA representation is given in the following table.

A more complex lexical database

In a complex project, lexical information from several sources may need to be integrated in a fashion which permits flexible further development work even when the information cannot easily be reduced to a logically fully consistent and well-defined system. A situation such as this will arise when alternative modules, based on different principles, are to be made available for the same system. For instance, two different syntactic components will define different forms of syntactic ambiguity and be associated in different ways with semantic ambiguities. And morphological ambiguities arise with inflected forms in highly inflecting languages. In order to achieve any kind of integration, at least the word form representations will need to be consistent. The hybrid information sources will have to be represented as conjunctions of the values of independent attributes (i.e. fields within a record), with separate disjunctions, where relevant, within fields.

In general, spoken language projects have been based on the idealised notion of a single, well-defined, consistent and complete; this situation might reasonably be expected to correspond to the reality of a system developed in a single laboratory at one specific location. However, larger scale projects need to be able to cope with hybrid lexical information of the kind just outlined. A project of this type is the VERBMOBIL project funded by the German government, with international participation.

An example of a database structure designed for hybrid lexical information in this kind of context is given here.

Internal database structure (standard UNIX database format):
- database: header records followed by body records
- header: header_record_1 header_record_2 header_record_3
- body: body_record_1 ... body_record_n
- header_record_1: (record containing attribute names, i.e. field names)
- header_record_2: (record defining internal conjunctive/disjunctive structure of attribute values, i.e. field contents)
- header_record_3: (record containing source of information)
- body_record_i: (record containing values for a given entry)
Example of record structure:
- Header: (the designations A3 etc. refer to projects delivering particular types of information)
  Record 1:
  Orth A3 B1 C1 D1
  
  Record 2:
  Orth A3.Phon B1.Wortart,B1.Kasus,B1.Genus,B1.Num,B1.Detagr,
  B1.Definit,1.Semobj,B1.Semattr C1.Syncat1_C1.Syncat2 D1.Syncat
  Record 3:
  reference.ort a3joha.lex b1naeve.lex c1jung.lex d1peters.lex
- Body:
  Mutter mU!t6 nomen_akk,fem,sg,@empty@,@empty@,Raute,@empty@;
  nomen,nom,fem,sg,@empty@,@empty@,Raute,@empty@ Nom,OBJEKTTYP nom
- Note that the spaces designate conjunction (i.e. field separators), while the semicolons designate disjunction

Example of human-readable formatting

Entry 372: Mutter
  Orth: Mutter
  A3:   mU!t6
  B1:   nomen,akk,fem,sg,@empty@,@empty@,Raute,@empty@
        nomen,nom,fem,sg,@empty@,@empty@,Raute,@empty@
  C1:   Nom,OBJEKTTYP
  D1:  
nom

Example of UNIX tool for human-readable formatting (transformation of selected named attributes of a database record into the attribute format given above):

#!/bin/sh
# dbviewr
# Prettyprint of single entries
# and
attributes in lexicon database
# with regular expression matching
# Uses UNIX tools:
#  gawk (i.e. gnu awk), sed, tr
# (Note: sed and tr are used for illustration, and would
#        normally be emulated in gawk)
# Database structure:
# Header: Record
1: Fields containing attribute names.
#         Record 2: Other information.
# Body:   Records >2: Database relation.

if [ $# -lt 3 ]
 then
 echo "Usage: dbview dbname attribute* regexp"
 echo "dbview V2, 15.1.1994, D. Gibbon"

exit
fi

gawk '

# Transfer the keyword from the command line to a variable.
BEGIN {keyword = ARGV[ARGC-1]}

# Identify the attributes in the first record whose values
# are to be queried.
NR == 1 {{for (i=2 ; i < ARGC ; i++)
        {for (j=1 ; j
<= NF ; j++)
        if (ARGV[i] == $j) {attrib[j] = "yes"; attname[j] = $j}}}
        {for (i = 2 ; i < ARGC ; i++)
        ARGV[i]=""}}

# Find required keyword entry/entries in body of database
# Print required values and
set 'found' flag
$1 ~ keyword && NR > 2  {print "\nEntry " NR-2 ":", $1
        {for (i=1 ; i <= NF ; i++)
        if (attrib[i] ~ "yes") {print "  " attname[i] ":\t" $i
                   
            found="yes"}}}

{last=NR}

# Print message if no entry was found for the keyword.
END {if (found!="yes") {print "No entry found for",keyword,"in",ARGV[1]}}
 ' $* |

# Translate all sequences of two
colons into a slash, and all single
# colons into a single colon followed by eight spaces.
sed -e "s/;;/\//g
        s/;/&        /g" |

# Translate all single colons into a linefeed (newline).
tr ";"
"\012"

For an overview of useful UNIX tool programming techniques, see =1 (

; Aho Kernighan Weinberger 1987) , =1 (

; Dougherty 1990) , =1 (

; Wall Schwartz 1991) .

Recommendations on lexicon structure

The development of re-usable resources is likely to be best supported by the use of a LKRL with an efficient and flexible implementation, and the use of formalisms in this category can therefore be recommended for development purposes where re-usable resources are concerned.
Whatever the choice of formalism for a spoken language lexicon, the interface between the lexicon and other components of a spoken language system needs to be given close attention.
A given single formalism is highly unlikely to provide optimal features for the development and construction of all components of a given system in the near future.
Object-oriented database structures and system architectures are likely to be used increasingly in spoken language systems in order to permit the encapsulation of module-specific optimal formalisms, implementations and information.
Weigh the advantages of the use of specialised commercial databases against the practical convenience and portability of ASCII database standards associated with the UNIX operating system and UNIX ASCII processing tools for database format conversion and database access.
Consider whether a system lexicon should be a single component, or, as is more usual in speech recognition applications, a ``distributed lexicon'' with lexical information associated with the speech recognition, stochastic language model and lexical search components.

Next: Lexical knowledge acquisition Up: Spoken Language Lexica Previous: Lexical content information

WWW Administrator
Fri May 19 11:53:36 MET DST 1995