Next: Lexical content information Up: Spoken Language Lexica Previous: Morphological information

Grammatical information

Statistical language models

Language models are a major area of research and development, and crucial for the success of a speech recognition system. The chapter on language modelling should also be consulted on this topic.

In speech recognition systems, the mapping of the digitised acoustic forms of words on to symbolic representations for use as lexical lookup keys is performed in primarily statistical approaches by pattern recognition, and in primarily knowledge-based approaches by phonemic rule systems. The word models used for matching with the acoustic analysis of the speech signal are accordingly very different. The literature can be consulted for details of current practice (cf. =1 (

; Waibel Lee 1990) ).

In written language processing, a comparable task is Optical Character Recognition (OCR) , and in particular, handwriting recognition; there is no comparable task in conventional Natural Language Processing or Computational Linguistics, where letters are uniquely identified by digital codes, and dictionary access may be trivially optimised by encoding letter sequences into tries (letter trees, letter-based decision trees). However, in linguistic terms, in each case the task is the identification of word forms as lexical keys.

Current statistically oriented technology is based on dynamic programming, on Hidden Markov Models , or on Neural Networks . At the present time, Hidden Markov Models are fairly generally considered to provide the highest recognition and accuracy rates. For the restricted tasks which dominate much previous and current work on Spoken Language Input Systems, HMM technology is consequently recommended.

Statistical language models are used to reduce the search space for the correct hypotheses in the word lattice output of a speech recogniser (cf. =1 (

; Oerder Ney 1993) ). They provide top-down (a priori) information on the distribution of words relative to immediately preceding (strings of) words, and may be n-gram models (sequences of n words, including the target word), stochastic grammars , or probabilistic networks and automata. Since statistical language models are based on examination of a finite corpus, they also presuppose a finite vocabulary. Hence, the statistical distribution properties expressed by a statistical language model constitute a form of probabilistically weighted syntactic information about lexical items and may be considered a highly specialised form of spoken language lexicon. Statistical language models for spoken language are corpus-specific, and it is hard to consider them as general resources in view of the fairly small size of the corpora. For this reason, they are best defined in terms of the standard principles on which they are based, with references to the basic components of a language model:

Input data formats for the construction of statistical language models (e.g. with fixed and consistent transcription conventions).
An output data structure for the language model (a table, tree, network, etc.).
A construction function mapping the input data to a statistical language model, which can be digram table construction with UNIX tools or more sophisticated procedures for the construction of tree, network or other database structures.
A set of access functions for utilising the language model in a spoken language recognition system, yielding such values as statistical properties of words (for instance the negative logarithm of the conditional probability) or the perplexity of the model (its average branching factor, when interpreted as a transition network).

The simplest n-gram models are based on the following lexicographic concepts.

A word list is an ordered list of word form types (each represented by one token only) extracted from the possibly numerous occurrences of word form tokens in a corpus.
A word frequency list is a word list in which each type is paired with the number of occurrences of corresponding tokens in a given corpus.
The type/token ratio and the rate of its increase with the size of a corpus is a mark of the representativity of a corpus.

The definition of type is not trivial, and in general it is not very explicitly defined. In practice, this means that frequencies of words in a given training corpus need to be tabulated; these frequencies are divided by the corpus size (in word tokens ) for conversion into probability estimates.

Often an orthographic form is taken to represent a type, even though this may introduce loss of information through orthographic noise , introducing artefactual ambiguities in the case of homographs . A better procedure, though more cumbersome in practice, is to treat a lexical entry for the purposes of a word list as a pair of orthographic and pronunciation types.

The ranking of hypotheses in speech recognition (indirectly: a form of disambiguation of the speech input) calls for the calculation of both a posteriori (currently observed) and a priori (previously observed and generalised) probabilities in accordance with Bayes' Law:

The probability of a word, given a set of observations, is the probability of the observations, given the word, multiplied by the a priori probability of the word, and divided by the probability of the observations. Since the last factor is common to all observations for a given system and scenario , it can be factored out:

                 p(W|0) = P(O|W) * P(W) / P(O)

More specific descriptions are given in the literature such as =1 (

; Bahl et al. 1989) .

In order to evaluate results, analysis of a corpus is required. Common practice in development is to partition a corpus into 90% training data and 10% test data, and to run tests on, for instance, 10 different partitionings in order to ensure homogeneity. On the question of evaluation, the results of the EAGLES Working Group on Evaluation should be consulted.

More generally, statistical language models are used in order to enhance the a priori predictions and thus enhance the hypothesis selection process. Statistical language models are also in a sense word lists, but lists in which the types are themselves paired with the contexts in which the tokens occur and with the frequencies of these relative occurrences. If the context is one preceding word, the result is a digram table ( bigram table). This technique can be used with any size of linguistic unit, for example with phonemes , from which diphone or diphoneme tables are constructed. If the context is more than one preceding word, the result is a trigram ( triphone etc.) table. In the general case, a distribution table of this kind is an n-gram , n-phone etc. table, where n is the size of the context including the target unit itself.

A basic constraint on frequency information and probability estimates in the lexicon is the inverse relation between size of unit and number of occurrences: clearly, smaller units such as phonemes will occur more frequently in a corpus than words, words will occur more frequently than sentences, digrams will occur more frequently than trigrams , and so on. The larger the unit, the greater becomes the sparse data problem, i.e. the problem of having too few occurrences (in particular, zero occurrences) of a given unit to be statistically significant.

The basic item of syntactic lexical information required in spoken language systems is the part of speech (POS, see below). The information is required in order to enhance language models in the case of sparse data; classes of items judged to be similar in distribution occur more frequently than their members; thus, di-class (in the general case, n-class) statistics hold promise of better statistical generalisations than word-based digram statistics. Experiments with semantic classes are also being made (cf. =1 (

; Yarowsky 1991) ). For further information, standard texts such as =1 (

; Waibel Lee 1990) , =1 (

; Sagerer 1990) , =1 (

; Haton et al. 1991) should be consulted.

In respect of general recommendations, the chapter on language modelling should be consulted. It is hard to make a general recommendation for future developments from a lexicographic point of view, in view of much current basic research work on the equivalence of HMMs with certain types of Neural Networks and probabilistically weighted knowledge-based systems based on finite state automata (cf. =1 (

; Kornai 1991) , =1 (

; Carson-Berndsen 1993) ).

Sentence syntax information

Syntactic information is required not only for parsing into syntactic structures for further semantic processing in a speech understanding system, but also in order to control the assignment of prosodic information to sentences in prosodic parsing and prosodic synthesis.

Syntactic information is defined as information about the distribution of a word in syntactic structures. This is a very common, indeed `classical', but specialised use of the words `syntax ' and `syntactic' to pertain to phrasal syntax , i.e. the structure of sentences. Other more general uses of the terms for linguistic units which are larger or smaller than sentences are increasingly encountered, such as `dialogue syntax ', `word syntax ' (for morphotactics within morphology ).

Within this classical usage, the term syntax is sometimes opposed to the term lexicon; the term grammar is sometimes used to mean syntax , but sometimes includes both phrasal syntax and the lexicon.

Strictly speaking, a statistical language model is a form of sentence syntax defined for a finite language, since it also defines the distribution of words in syntactic structures; however, the notion of syntactic structure here is very elementary, consisting of a short substring or window within a string of words. It is also used with quite a different function from the classical combination of sentence syntax and sentence parser .

Sentence syntax defines the structure of a (generally unlimited) set of sentences. Syntactic lexical information is traditionally divided into information about paradigmatic (classificatory; disjunctive; element-class, subclass-superclass) and syntagmatic (compositional; conjunctive; part-whole) relations. The informal definitions of these terms in linguistics textbooks are often unclear, metaphorical and inconsistent. For instance, temporally parallel information about the constitution of phonemes in terms of distinctive features is sometimes regarded as paradigmatic (since features may be seen as intensional characterisations of a class of phonemes ) and sometimes as syntagmatic (since the phonetic events corresponding to features occur together to constitute a phoneme as a larger whole). The relation here is analogous to the relation between intonation and sentences, which are also temporally parallel, and in fact treated in an identical fashion in contemporary computational phonology. From a formal point of view, this is purely a matter of perspective: the internal structure of a unit (syntagmatic relations between parts of the unit) may be seen as a property of the unit (paradigmatic relation of similarity between the whole unit and other units). In lexical knowledge bases for spoken language systems it is crucial to keep questions of syntagmatic distribution and questions of paradigmatic similarity apart as two distinct and complementary aspects of structure.

The part of speech ( POS, word class, or category) is the most elementary type of syntactic information. One traditional set of word classes consists of the following: Noun or Substantive, Pronoun, Verb, Adverb, Adjective, Article, Preposition, Conjunction, Interjection . Two main groups are generally identified:

Lexical categories are the open classes which may be extended by word formation: Noun, Verb, Adjective, Adverb.
Grammatical categories are the closed classes which express syntactic and indexical relations: Pronoun and Article (anaphoric and deictic relations), Preposition (spatial, temporal, personal relations etc.), Conjunction (propositional relations), Interjection (dialogue relations).

The granularity of classification can be reduced by grouping classes together in this way (this particular binary division is relevant for defining stress patterns, for example), or increased by defining subclasses, depending on particular requirements. For further information, introductory texts on syntax , e.g. =1 (

; Sells 1985) or =1 (

; Radford 1988) may be consulted.

Recommendations on grammatical information

Consult the chapter on language models for further details on the application of language models in speech recognition.
Consult the chapter on speech synthesis for further details on grammar in the production of speech.
Consult the results of the EAGLES Working Groups on Formalisms and on Computational Lexica for additional information about the use of grammatical information in sentence parsing.
Ensure that a thorough grammatical analysis is performed on the application-relevant data in order to ensure that no more grammatical complexity is introduced than is actually needed.

Next: Lexical content information Up: Spoken Language Lexica Previous: Morphological information

WWW Administrator
Fri May 19 11:53:36 MET DST 1995