Next: Morphological information Up: Spoken Language Lexica Previous: Types of lexical

Lexical surface information

Orthographic information

Orthography has been used in several different roles in the context of spoken language lexica:

Convenient general reference labels for words, due to the high level of awareness of, familiarity with and standardisation of orthography in literate societies.
Convenient identifying names for lexical entries, for `normal lemma ' forms, and for headwords in complex lexicon entries for related words.
Convenient identifying names for word hypotheses in word lattices, as lexical lookup keys.
Visualisation of word hypotheses in a development system.
Representation of the orthographic properties of words.

Each of these functions is distinct and needs to be kept conceptually separate in order to avoid confusion. On the side of word forms, the functions (1) and (2) are not particularly problematic. Function (3) is traditionally a feature of speech recognition systems for relatively small vocabularies. The larger the vocabulary, however, the greater the danger of introducing unnecessary orthographic noise , i.e. intrusive artefacts due to homography (words with identical spelling and different pronunciation); for this reason, in new architectures, phonological (e.g. phonemic) representation in word graphs is preferred. Function (4) is unproblematic, though similar reservations as with (3) are to be noted. Function (5) is obviously essential for written output of any kind; however, it is often confused with both functions (2) and (3).

Orthography has the advantage of being highly standardised, except for certain regional variants (British and American English; Federal, Swiss, and Austrian German) and variations in publishers' conventions (e.g. British English ise/-ize as in standardisation/ standardization, capitalisation of adjectives in nominal function in German, as die anderen / die Anderen, or variations in hyphenation conventions). Orthography is given further attention in the section on lexical representation.

A standard orthographic transliteration is often used for convenience as a means of representing and accessing words in a spoken language lexicon. This has several reasons:

Familiarity to all educated speakers of the language.
High level of standardisation in comparison with theory-influenced phonological transcriptions .
Sufficient proximity to phonological form ensures a close mapping to pronunciation at the level of whole words (not necessarily in the details of grapheme to phoneme mapping) in small vocabularies in some languages: French and English are notorious exceptions.

Most European languages have highly regulated orthographies , the use of which is associated with social and political rewards and punishments. Variation is found particularly in the treatment of derived and compound words (e.g. separation and hyphenation) and in the use of typographic devices such as capitalisation.

For use in spoken language lexica, particularly in word lists used for training and testing recognisers, consistency is essential and often additional conventions are required in order to meet the criterion of general computer readability in the case of special letters and diacritics. Although it cannot be regarded as a standard, it is becomming common practice to use the ASCII codings of LaTeX adaptations for specific countries. For example, a standard computer-readable orthography for German has become widely accepted for German speech recognition applications which marks special characters, in particular those with an Umlaut diacritic, as shown in the table.

The results of the EAGLES Working Groups on Text Corpora and Lexica should be consulted on orthographic and other matters pertaining to written texts.

Pronunciation information

Pronunciation information is much more application specific (and indeed theory specific) than orthographic information. Standardly, information about phonemic structure is included in the form of a phonemic transcription of a standard canonical or citation form pronunciation, i.e. the pronunciation of a word in isolation in a standard variety of the language. Often the phonemic transcription is enhanced by including prosodic information such as the stress position (Dutch, English, German), type of tonal accent (Swedish), syllable , word boundaries in compound words, and word and phrase boundaries in phrasal idioms . Morphological information (morph boundaries , as well as the boundaries of words and phrases) is relevant to stress patterning, and is sometimes also included.

A particularly thorny question is the inclusion of information about pronunciation variants. The following rules of thumb can be given:

Pronunciation lexica for synthesis generally require only one standard (canonical ) pronunciation; however, variants of these with different prosodic contexts are often used.
Pronunciation lexica for recognition require a distinction to be made between variants of the same word, and variants which are associated with the same spelling but different words (heterophonic homographs ).
Strictly speaking, pronunciation lexica for recognition require only variants to be listed which are idiosyncratic and cannot be predicted by rule (e.g. English either /aID@/ --- /i:D@/). Variants which are general and regular (such as the reduction of schwa + liquid or nasal to a syllabic liquid or nasal ) can be calculated using pronunciation rules (phonological rules): English running /rVnIN/ --- /rVnIn/, German einem /aIn@m/ --- /aIn=m/ --- /aIm/).

Although phoneme is a technical term with somewhat different definitions in different theoretical contexts, and although there are technical arguments due to Generative Phonology (cf. =1 (

; Chomsky Halle 1968) ) which show that the notion of phoneme leads to inconsistencies, the core of phoneme theory is relatively standard. In linguistics handbooks, the phoneme is commonly defined as the minimal distinctive (meaning-distinguishing) unit (temporal segment) of sound. In the following fairly standard definition, the distinctiveness criterion is implicit in the concept of a system; the concept of a sound covers possible variants of a phoneme (e.g. English aspirated word-initial /p/ as opposed to unaspirated /p/ in the context /sp.../ ( =1 (

; Crystal 1985 : 228 ) ):

phoneme (phonemic(s)) The minimal unit in the sound SYSTEM of a LANGUAGE ... Sounds are considered to be members of the same phoneme if they are phonetically similar and do not occur in the same ENVIRONMENT.

A fairly complete definition is thus based on distinctiveness, minimality, phonetic similarity and distributional complementarity. Phoneme definitions are differential or relational definitions, illustrated by the notion of minimal difference between two words in a minimal pair (such as the items in the set of English words pin--tin--kin--fin--thin--sin--shin--chin--bin--din--gin, or in standard SAMPA computer readable phonemic transcription : /pIn--tIn--kIn--fIn--TIn--sIn--SIn--tSIn--bIn--dIn--dZIn/). Phonemes defined as in this way are further classified as bundles of distinctive features. Operationally, phonemes are defined by procedures of segmentation and classification (reflected, for example, in the recognition and classification components of automatic speech recognition systems):

Segmentation is the procedure of isolating the minimal distinctive temporal phonetic segments ( phones ).
Classification is the procedure of classifying phones as allophones (phonetic alternants of the same phoneme , mainly on the grounds of their phonetic similarity and their complementary distribution (i.e. their occurrence in complementary contexts as contextual variants of that phoneme ).

In contrast to orthographic representations, which for social and cultural reasons, are highly standardised common knowledge, direct lexical representations of pronunciation count as highly technical, and are consequently theory and application specific. The most widely used representations in pronouncing dictionaries for human use, foreign language teaching, and in spoken language systems are phonemic transcriptions .

Phonemic descriptions are available for several hundred languages, and phonemic transcriptions based on these are suitable for constructing roman orthographies for languages which have orthographies based on different principles or no orthography at all. For a given language, phonemic descriptions differ peripherally (for instance: Are diphthongs and affricates one phoneme or two?). On these grounds, phonemes are in general the units of choice for phonological transcriptions in spoken language system lexica. Other, more specialised types of representation such as feature matrix representations required by all modern phonological descriptions, and even autosegmental directed acyclic graph representations, or metrical tree graph and histogram representations (cf. =1 (

; Goldsmith 1990) ) are increasingly finding application in experimental systems (cf. =1 (

; Kornai 1991) ,

=1 (

; Carson-Berndsen 1993) , =1 (

; Church 1987a) , =1 (

; Church 1987b) ) because of their richness and their more direct relation to the acoustic signal. However at the lexical level, they can generally be calculated from the more compact phonemic representations . Because of the widespread use of phonemes , the concept is discussed in more detail below; for fuller explanations, textbooks in phonology should be consulted.

The central question in phonological lexical representation in cases where the notion of phoneme alone is not fully adequate, is that of the level of representation (or level of abstraction). There are three main levels, each of which is an essential part of a full description, and which need to be evaluated for a given application:

Morphophonemic :

The morphophonemic level provides a simplification of phonological information with respect to the phonological level; the simplifications utilise knowledge about the morphological structure of words, and permit the use of morphophonemes , (a near-synonym is archiphoneme ) which stand for classes of morphologically and phonologically related phonemes .

There are no standard conventions for the representation of morphophonemes , whether computer readable or not (but see the SAMPA alphabet for French, Appendix A), though capital letters are often used in linguistics publications. Note that this use of capital letters at the morphophonemic level should not be confused with the use of capital letters in the SAMPA alphabet at the phonemic level.

Citations of morphophonemic representations are often delimited with brace brackets {...}.

A standard example of a morphophoneme is the final obstruent in languages with final obstruent devoicing , including Dutch and German. For example, the phonemic representation German Weg /ve:k/ `way' --- Wege /ve:g@/ `ways' corresponds to a morphophonemic representation {ve:G} --- {ve:G+@}, which simplifies the description of the stem of the word. The morphophoneme {G} stands for the phoneme set {/k/, /g/}, and selection of the appropriate member of the set (the appropriate feature specification) is triggered by the morphological boundary and neighbouring phonological segments. Alternatively the morphophoneme may be said to consist of the underspecified feature bundle shared by /k/ and /g/, or more technically, the feature bundle which subsumes the feature bundles of /k/ and /g/.

An example from English is the alternation /f/ --- /v/, as in knife /naIf/ --- knives /naIvz/, which can be represented morphophonemically as {naIV} --- {naIV+z}. The morphophoneme {V} stands for the phoneme set {/f/, /v/}. Here, too, selection of the phoneme (specification of the underspecified subsuming feature bundle) is determined by the morphological boundary and the phonological properties of neighbouring segments.

A corresponding level is necessary for the description of spelling: cf. variations such as English y-ie in city --- cities, or German s-ss-ß as in Bus --- Busse, Kuß --- Küsse and Fuß --- Füß e.

Morphophonemic representations augmented by realisation rules are a powerful compression technique for reducing lexicon size:

Lexica can be stem -based, and thus have fewer entries, and all inflections can be automatically calculated by rule for any stem in the lexicon.
Morphotactic and morphophonological rules can be used for extending lexica of fully inflected attested forms, and for checking such lexica for consistency.

For requirements such as these, the use of morphophonemic representations, supplemented by morphological construction rules and morphophonemic mapping rules is recommended (cf. =1 (

; Koskenniemi 1983) ,

=1 (

; Karttunen 1983) and =1 (

; Ritchie et al. 1992) for descriptions of standard technologies).

Phonemic:

The phonemic level is a standard intermediate level corresponding to criteria outlined in more detail below. The standard European computer readable phonetic alphabet is SAMPA (Appendix A): this alphabet is used for the main languages of the European Union, and is recommended for this purpose. The internationally recognised standard alphabet for phonemic representations is the International Phonetic Alphabet (IPA ); this alphabet is shown in Appendix A. One of the main functions of the International Phonetic Association since its inception over 100 years ago has been to coordinate and define standards for this alphabet.

Until relatively recently, the special font used for the IPA has made it difficult to interface with spoken language systems, and for this reason a number of computer-readable encodings of subsets of the IPA have been made for various languages (cf. =1 (

; Allen 1988) , =1 (

; Esling 1988) , =1 (

; Esling 1990) , =1 (

; Jassem Lobacz 1989) , =1 (

; Ball 1991) ). The standard computer phonetic alphabet for the main languages of the European Union is the SAMPA alphabet, developed in the ESPRIT SAM and SAM-A projects (cf. =1 (

; Wells 1987) , =1 (

; Wells 1989) , =1 (

; Wells 1993b) , =1 (

; Wells 1993a) , =1 (

; Llisterri Mariño 1993) ); see also Appendix A. SAMPA is widely used in European projects, both for corpus transcription and for lexical representations (see also the chapter on Spoken Language Corpora).

However, there is a standard numerical code for IPA symbols (cf. =1 (

; Esling 1988) , =1 (

; Esling 1990) ; Appendix A), and developments in user interfaces with graphical visualisation in recent years are leading to the increasing use of the IPA in its original form, particularly in the speech lab software which is used in spoken language system development.

Citations of phonemic representations are standardly delimited by slashes /... /.

Phonetic:

At the phonetic level further details of pronunciation, beyond the phonemically minimal, are given. Since the relation between the phonemic and the phonetic level can be described by general rules mapping phonemes to their detailed realisations (allophones ) in specific contexts (cf. =1 (

; Woods Zue 1976) ), it is strictly speaking redundant to include these regular variants in a lexicon. However, for reasons of efficiency, detailed word models may be calculated using phonological rules and stored. Essentially this is a software decision: whether to use tables (for efficiency of lookup) or rules (for compactness and generality) for a given purpose.

A specific version of the phonetic level of transcription is phonotypic transcription , defined as a mapping from the phonemic level using regular phonological rules of assimilation , deletion , epenthesis (cf. =1 (

; Autesserre Pérennou Rossi 1989) ); this level is frequently used for generating additional word models to improve speech recognition. Since the amount of phonetic detail which can be processed depends heavily on the vocabulary size and the number of phonological rules which are considered relevant, no general recommendation on this can be given.

There is no widely used standard ASCII encoding of the entire IPA for computer readable phonetic representations and therefore no recommendations can be given on this. Currently, individual laboratories use their own enhancements of phonemic representations . However, the fuller encodings mentioned in connection with the phonemic level of transcription are eminently suitable for interface purpose at the phonetic level, and will no doubt be increasingly used where more detailed phonetic information is required.

Citations of phonetic forms are standardly delimited by square brackets [ ... ].

The chapter on Spoken Language Corpora should also be consulted in respect of levels and types of representation.

Prosodic information

The area of word prosody , and, more generally, the description of other prosodic units which have quasi-morphemic functions, is gradually emerging as an important area for spoken language lexica. For present purposes, prosodic properties are defined as properties of word forms which are larger than phonemes ; further specification in phonetic terms (e.g. F0 patterning ) and in semantic terms (e.g. attitudinal meaning) may also be given but are not essential for present purposes.

One type of lexical information on prosody pertains to phonological or morphological properties of words, such as Swedish pitch accents, or stress positions in words (for German cf. =1 (

; Bleiching 1992) ). Some aspects of word prosody are predictable on the basis of the regular phonological and morphological structure of words, but some are idiosyncratic. Examples in English where word stress is significant include the noun-verb alternation type as in export - /''EkspOt/ (Noun), /Eksp''Ot/ (Verb). In German, word stress is significant for instance in distinguishing between compound separable particle verbs and derived inseparable prefixed verbs as in "ubersetzen - /''y:b6z

It has been shown (cf. =1 (

; Waibel 1988) ) that taking word prosody into account in English can produce a significant improvement in recognition rate.

In addition, there is lexical information associated with prosodic units which occur independently of particular words, and therefore may themselves be regarded as lexical signs and be inventarised in a prosodic lexicon (cf. =1 (

; Aubergé 1992) ). To give a highly simplified example in a basic attribute-value notation, a prosodic lexicon for an intonation language might have the following structure.

Terminal_1:
  <phonetics pitch> = fall
  <semantics>       = statement or instruction.

Terminal_2:
  <phonetics
pitch> = rise
  <semantics>       = question or polite instruction.

This kind of information, in which prosodic categories function as a kind of morpheme with an identifiable meaning, is generally not regarded as lexical information, but treated as a separate layer of organisation in language. Intonation is being taken increasingly into account for prosodic parsing in two main senses of the term:

Analysis of speech signal in respect of the fundamental frequency ( F0, F-zero trajectory, for speech recognition.
Analysis of sentence structure for the generation of intonation patterns in speech synthesis.

Prosodic representation in the lexicon is in general restricted to the prosodic properties of words, such as stress position in English, Dutch, and German words, or tonal accent in Swedish words, or to rhythmically relevant structures such as the syllable and the foot . For spoken language processing purposes where prosody plays a role, it is also necessary to include an inventory of prosodic forms, and their meanings, which play a role at the sentence level, independently of specific words: i.e. a prosodic lexicon .

The IPA defines means of representing both types of information, and a subset (for word prosody ) has been encoded in the SAMPA alphabet. However, the state of knowledge in the area of prosody is less stable than in the area of segmental word structure, and a range of different conventions are available (cf. =1 (

; Bruce 1989) ); in this area, there are even SAMPA `dialects ', for instance replacing SAMPA " and % for primary and secondary stress by the more iconic ' (single quote) and '' (two single quotes) or " (double quote).

For American English, the ToBI (Tones and Break Indices) transcription has been extensively applied, but this system is not immediately applicable to other languages, though experimental applications and adaptations are being developed; see also the chapter on Spoken Language Corpora).

In more theoretically oriented spoken language lexicography within the VERBMOBIL project, attribute-based formal representations of prosodic features in the lexicon have been developed using the ILEX (Integrated Lexicon) model and the lexical knowledge representation language DATR (cf. =1 (

; Bleiching 1992) , =1 (

; Gibbon 1991) ).

There is an increasing tendency no longer to regard prosodic representations as totally exotic and quite unlike anything else, and a degree of convergence can be observed. But there is still insufficient consensus on lexical prosodic features to permit generally valid recommendations to be made for prosodic representations in the lexicon. For most purposes, plain SAMPA -style symbols will be adequate, though an extended set of conventions is available in the SAMPROSA alphabet (see Appendix). For covering new ground with extended lexica for use with discourse phenomena at the dialogue level, a lexical knowledge representation language with a more general notation, as illustrated above, may be more appropriate.

Recommendations on lexical surface information

Define the basic lexical entry type, as recommended in the previous section, and its notation.
Establish a machine-readable orthography convention.
Define the phonological level of representation ( morphophonemic, canonical phonemic).
Specify requirements in respect of word prosody representation.
Specify requirements in respect of phrasal and discourse prosody (intonation) information.
Use the European standard machine readable phonemic alphabet SAMPA (See Appendix).
Select an appropriate prosodic representation, such as SAMPA, SAMPROSA, or ToBI notation, where required.
Ensure that the relation between notations and representations used in the lexical database and the system lexicon are well-defined, and that they are completely consistent with notations and representations in other resources such as corpora and in the different parts of the system, such as the word lattice produced by the speech recogniser, operated on by a stochastic language model, and further processed by a sentence parser.

Next: Morphological information Up: Spoken Language Lexica Previous: Types of lexical

WWW Administrator
Fri May 19 11:53:36 MET DST 1995