Next: Outlook Up: Spoken Language Lexica Previous: Lexicon structure

Lexical knowledge acquisition for spoken language

The most general declarative perspective on a spoken language lexicon which is required for lexicon acquisition is that of the `omniscient lexicographer': the lexicographer `knows', in principle, all the possible categories and relations which may hold between a sign, its meaning and pronunciation, and other signs; all properties of a sign have equal status in terms of possible access. This idealised view, while useful for general lexical databases, is not appropriate for the construction of more specialised spoken language lexica for the classical types of spoken language system, unless these are derived from a more general lexical database as sublexica.

Stages in lexical knowledge acquisition

The first stage in lexical acquisition is to define the following items:

An application dependent lexicon model which defines relevant types of lexical information.
A lexical database model which defines the storage structure for lexical information.
A characterisation of the sources of lexical data (e.g. other databases; manual information provided systematically or ad hoc by a lexicographer)
Definition of procedures for constructing a lexicon from the data on the basis of these models and sources.
Definition of procedures for validating the consistency of the database.
Definition of procedures for extracting complex lexical information (e.g. coverage statistics for attributes and attribute combinations).
Tools for implementing the procedures.

An example of on particular type of application is the construction of word-oriented information required for word recognition. In respect of a stochastic , and the three lexical information types represented by pronunciation table, frequency list, digram language model, the following steps are required (the first two of which are dealt with in more detail in the chapter on Spoken Language Corpora).

Pre-recording phase: Define the situational, linguistic, and physical features of the application scenario .
Recording phase: Record, post-process, store and structure the digitally recorded speech.
Post-recording phase:
1. Transliteration . Make an orthographic transliteration of the speech signal; for traditional data types of utterances read aloud in the laboratory, the transliteration is already given. For spontaneous speech, the transliteration must be made according to carefully defined orthographic criteria.
2. Frequency list. Reduce the orthographic token representations of words to an ordered set, with one token per word (in the process, make a frequency count for each word).
3. Digram language model. Make a list of adjacent pairs (digrams) of orthographic token representations and reduce the list to a set, with one token pair per digram (in the process, make a frequency count for each digram), and convert to an optimal data structure such as a decision tree.
4. Pronunciation table. Using a pronunciation dictionary or grapheme-phoneme conversion rules (statistical or knowledge-based or both), construct a table of pronunciations for the orthographic token representations.
5. Label alignment. The pronunciation table may then be used for forced alignment (semi-automatic label alignment , SALA) of whole words, syllables , phonemes , etc. with the speech signal, for the purpose of training a stochastic speech recogniser. The word and digram frequency information can be used to weight recognition hypotheses in terms of Bayes' Law, as discussed in connection with language models and word models.

Types of knowledge source

The types of lexical knowledge source for a spoken language system depend largely on the application. There are few general sources of lexical spoken language material (for instance with pronunciation and general frequency information) for any language. The construction of such a source is a major task which requires concerted action on a large scale by specialists of a whole language engineering community. It is a formidable task for many theoretical and practical reasons, but nevertheless one which will require a great deal of effort in the coming years. The two major sources of lexical knowledge for spoken language lexical systems are:

Existing dictionaries (to some extent).
Application specific corpora (to a large extent).
Results of descriptive, theoretical and computational linguistics (to some extent).

There is a definite lack of general resources in the area (cf. the introduction to this chapter), and the construction of application-derived, generalisable resources will be a major task for any project and for the entire spoken language community in the coming years.

General lexical material is required for the lexical knowledge in general coverage text-to-speech systems , as well as for broad application pronunciation tables for speech recognition.

Dictionaries

Useful sources of information are generally available dictionaries, particularly pronouncing dictionaries, provided that they adhere to accepted standards of consistency and expressiveness of notation, and are available in electronic form. An overview of some sources was given at the beginning of this chapter, and reference should be made to the results of the EAGLES Working Group on Computational Lexica for further examples.

Corpora

Spoken language lexica are usually application specific; they are necessarily application specific when corpus-derived frequency information is needed. An example of a corpus-derived lexicon type for speech recognition was given above. Another type of corpus-derived lexicon is the diphone word list widely used in speech synthesis technology; for this, phoneme label alignment with the speech signal is required, with the aid of which diphones are defined in the signal for further processing. The chapter on Spoken Language Corpora contains detailed information on procedures of corpus treatment, and the results of the EAGLES Working Group on Text Corpora should also be consulted.

Acquisition tools

At the current state of the art, there are few generally available tools for constructing spoken language lexica, either by extraction from existing dictionaries or from corpora. Lexicon construction usually takes place `in house' in individual laboratories or project consortia; lexicon formats consequently vary greatly.

For information on general acquisition tools in the sense of lexicographers' work benches, reference should be made to the results of the EAGLES Working Group on Computational Lexica.

Of greatest practical use for the development of spoken language lexica in the area of word forms are the tools required for creating different kinds of word form list and word form table from corpora; the general parameters associated with acquiring syntactic, semantic and pragmatic information are not unique to spoken language lexica (though the details, for instance of spoken language dialogue, indeed differ greatly from spoken to written language).

Standard practice is either to write custom-made programmes in C, or to use standard UNIX tools (where speed of processing is not at a premium). Neither of these procedures is particularly difficult, because of the relatively straightforward and well-understood procedures and associated algorithms.

The simplest approach for many applications where processing time is not critical, for instance with small lexica, or where batch-style processing is acceptable, is to use UNIX tools such as grep, tr, sed, uniq, cut, tail, spell and awk. For descriptions of these tools, a UNIX manual or textbook, or the man page on-line information on a UNIX system should be consulted; techniques for specific database oriented UNIX tools are described by Aho, Kernighan & Weinberger (1987), Dougherty (1990), Wall & Schwartz (1991).

An example of database formatting was given above. Simple examples of UNIX tool applications are illustrated in simplified form below in order to convey an idea of the sort of corpus pre-processing required for spoken language lexicon acquisition.

Transliteration to word list:

#!/bin/sh
# Simple wordlist generator
echo Wordlist generator
tr -sc `A-Za-z' `\012' < $1 | sort | uniq
> wordlist.srt
echo Wordlist in file `wordlist.srt'

Transliteration to frequency list:

#!/bin/sh
# Simple word frequency generator
echo Word frequency generator
tr -sc `A-Za-z' `\012' < $1 | sort | uniq
-c > wordlist.frq
echo Wordlist in file
`wordlist.frq'.

Transliteration to digram frequency table:

#!/bin/sh
# Simple digram table generator
echo Digram generator
tr -sc `A-Za-z' `\012' < $1 >
lines.txt
tail +2 lines.txt > tailed.txt
paste lines.txt tailed.txt | sort | uniq -c > digrams.tab
echo Digram frequency table in file `digrams.tab'.

Digram frequency information of this type is the basis for the construction of statistical language models.

Recommendations on lexicon construction

Identify and acquire the data resources needed for construction of the lexical database and, in the second instance, the system lexicon.
Construct a basic wordlist from the available data and decide on lexical coverage for the lexical database and the system lexicon.
Design the lexicon database to be acquired so as to maximise the information available for the development of the system lexicon.
Consider establishing a UNIX tool library for convenient database construction, transformation and transfer, in addition to using standard database software.
Results of the EAGLES Working Group on Computational Lexica should be consulted for the design of lexicon models and lexical resources for written language; some of the techniques are also applicable to spoken language lexica and lexical databases.
Do not underestimate the time and effort needed for constructing and validating software for lexicon construction and lexical lingware: lexicon construction is labour intensive.

In this area, there is an urgent need for the development of standardised lexicon construction tools and the provision of lexical resources.

Next: Outlook Up: Spoken Language Lexica Previous: Lexicon structure

WWW Administrator
Fri May 19 11:53:36 MET DST 1995