Next: Applications of spoken Up: Corpus design Previous: Introduction

Seven main differences between collections of written NL- and spoken SL-data

Traditionally, linguists and Natural Language Processing (NLP) researchers understood language corpora to consist of written material collected from text sources which already exist and often are available in published form (novels, stage and screen plays, newspapers, manuals, etc.). In this context the term ``spoken language text corpora'' was used to indicate that the data are not taken from existing texts but that speech had to be written down in some orthographic or non-orthographic form in order to become part of a data collection. However, the differences (and relations) between text- and speech-data are far more complex. There are at least seven important differences, which may not be ignored because they determine relevant properties of the resulting data collections. For future (technological) developments of Spoken Language Processing (SLP) they should be taken into account very seriously.

These seven differences have to do with:

The durability of text as opposed to the volatility of speech.
The different times it takes to produce text or speech.
The different roles errors play in written and spoken language.
The orthographic identity of written and the phonetic variability of spoken words.
The different data structure of ASCII-strings and sampled speech signals.
The two reasons that cause the great difference in the size of NL- and SL-data collections, and
The last but also most fundamental distinction (as well as relations) between symbolically specified categories and physically measured time functions.

A closer look at these seven differences between written and spoken data will reveal why the traditional term``Natural Language Processing '', NLP, also could well be read as standing for ``Non-spoken Language Processing ''. As it is our goal to call special attention to the relevant differences we will refer to the written language data as NL-data meaning non-spoken language data, and set it in opposition to the term SL-data, the acronym for spoken language data.

Durability of text, volatility of speech

The first distinction may seem to be rather trivial but it must nonetheless be mentioned here because it exercises an influence on specific properties of the collected NL- and SL-data. While text uses to stay on the paper when it is written down, speech is transient. It is the nature of the phonetic facts which speakers create during speech acts that they disappear just in the moment they come into existence. On the other hand, text is always produced with the intention to remain where the writer puts it, be it on paper, stone, or screen.

The first difference (which in the step from speaking to writing has helped our cultural development) explains why to collect SL-data is less trivial than to produce NL-data. The former must be recorded on a tape or a disk to make it accessible for future use.

Different production times for text and speech

Another difference between NL- and SL-corpora is due to the fact that speech data are time functions in a sense text data are not. Whilst a writer may consume any time he wants (or needs) to invest in producing a text, phonetic information is coded and transmitted through syllabically organised sound transitions. Speech must be running in its own natural time with a typical syllable rate of a value between 120/min and 180/min. The time for writing new text is normally much longer than it takes to read it aloud (which does not mean that silent reading and short-hand-writing can be much faster than speaking the text).

Correcting errors in the production of text and speech

In spontaneously spoken language the editing behaviour of the speaker is audible and remains a part of the recorded data. Interruptions, hesitations, repetitions of words (and parts of words), and especially self-repairs are a characteristic feature of naturally spoken language and must be represented in SL-data collections of spontaneous speech. On the other hand, the writer who has even more correcting and editing possibilities in producing a text document, will normally intend to produce a ``clean'' version. In the final version of the text all corrections which may have been carried out have disappeared; this is especially true for text intended to go into print. Thus it is extremely unusual to collect NL-data which contain all corrected errors, the replacements of words etc. that have occurred during the writing and editing of a given text document. In recording SL-data it has also not been unusual to produce similar clean speech collections. A typical example is so-called laboratory speech which is produced when a speaker who is sitting in a monitored recording room reads a list of prepared text material, and then only the proper reproductions of the individual text items are accepted to enter the data base. Examples of speech corpora collected in this way are EUROM-0 and EUROM-1. Recently, however, interest has shifted towards corpora comprising ``real-world'' speech, including hesitations, corrections, background noise, etc.

Orthographic identity and phonetic variability of lexicalised units

In correctly written texts any lexical item has just one and only one distinct orthographic form. Thus the words of European languages are easily identified and, with the exception of a few homographs, also well distinguished from each other, and there is usually also only exactly one version of each possible orthographic contextual form of any given word. This is the source of the fourth important difference between NL- and SL- data which often has been neglected because it is rather inconspicuous to the naive speakers and listeners. The spoken versions of orthographically identical word forms show a great phonetic variation in their segmental and prosodic realisation. In most European languages the phonetic form of a given word is in fact extremely variable depending on the context and other well defined intervening variables (such as speaking style and context of situation, strong and weak Lombard effects, etc.). A given word can even totally disappear phonetically, and possibly be reduced to --- and only signalised by --- some reflection of segmental features in the prosody of the utterance. Most of these inconspicuous variations appear in a narrow phonetic transcription of a given pronunciation.
It makes a great difference whether a word has been uttered in isolation or in a chain of connected speech. It is only in the first case that words show the phonetic form that is described in pronunciation dictionaries. Only if a word is consciously and very carefully produced in isolation we observe the explicit version of its segmental structure. These phonetically explicit forms produced in a careful speaking style are called citation or canonical forms. The segmental structure of so-called citation forms is modified (probably systematically, although very little of the system is understood) as soon as it is integrated into connected speech. Since this fourth difference between NL- and SL-data is so relevant for the design of spoken language corpora, it has also been taken account for in the conventions of the IPA proposed for Computer Representation of Individual Languages (CRIL).
In dealing with SL-data one must be able to know which words the speaker intended to express in a given utterance. This is reflected in the CRIL-convention of the IPA, which is formally described in Appendix A. Here it should be mentioned that an SL-data collection should ideally have at least two different symbolically specified levels which are related to the acoustic speech signal. On the first level the words of the given utterance are identified as lexical units in their orthographic form, and on the second level a broad phonetic transcription of the citation form should be given. This second level may be the result of automatic grapheme-to-phoneme conversion. For large SL-corpora it would cost too much time and too much money to make broad phonetic transcriptions by hand. How the given words have been actually pronounced in a given speech signal must be specified in terms of a narrower phonetic transcription of each individual utterance on a third, optional CRIL-level. This third level can then be directly aligned to the segments or acoustic features of the digital speech signal in the data base. And this information is especially relevant if also multi-sensor data are to be incorporated in SL-data bases.
It will be clear that detailed phonetic transcriptions are extremely expensive, to the extent that they are likely to be prohibitive in large corpora. However, recents attempts using large vocabulary speech recognisers for acoustic decoding of speech show some promise that the process can be automated, at least to the extent that pronunciation variation can be predicted by means of general phonological and phonetic rules.
In addition to phonetic detail on the segmental level, several uses of spoken language corpora may also require prosodic annotation. In this area much work remains to be done to develop commonly agreed annotation systems. Once such systems exist, one may attempt to support annotation by means of automatic recognition procedures.

Printable ASCII-strings and continuously sampled speech

Taken as pure data, any written text consists of strings of printable alphanumerical and other elements coded in 7- or 8-bit ASCII-Bytes. But the resulting NL-strings possess already a characteristic information structure which is not available in the case of primary SL-data. Separated by blanks ASCII-strings are grouped into lexical substrings, and also the proper punctuation of phrases and sentences is an important property of NL-data. Nothing of this type of information can be found in the recordings of primary SL-data since in natural speech there are no ASCII-elements representing full stops, commas, colons, quotation, question, exclamation marks. Recorded SL- data are primarily nothing but digitalised time functions.
Such recorded data are of no use, unless at least the sequence of words is known which is expressed in the recorded speech signal. As long as reliable automatic recognition of unconstrained speech remains an unsolved problem, much additional work has to be invested for producing the required symbolic annotation of the speech data. This explains why the preparation of useful SL-data collection takes so much effort and expense.

Size differences between NL- and SL-data

Comparing the pure extent of storing collected NL- and SL-data gives a great numerical difference. There are two reasons why SL-data require orders of magnitude more storage space than written language corpora. The first one is simply the difference in coding between text and speech. Whereas the ASCII-string of a word like `and' needs only three bytes, many more bytes are required as soon as the phonemes of this word are transformed into an acoustic output for storing the AD-converted data. If in the given example we assume that in clear speech the utterance of a three-phoneme-syllable takes about half a second and if we apply an amplitude quantization of 16 bits and a non-stereo hi-fi sampling rate of 48 kHz, the NL/SL-ratio amounts to approximately 1:16000.
The second reason follows from the great variability in the phonetic forms of spoken words. As pointed out above in difference #4, any written text must be reproduced by many speakers in more than one speaking style (at least in a slow, normal and fast speed with low, normal, high voice etc.), if the corpus is intended to reflect all possible sources of variability.
It should be clear that all these severe modifications of the phonetic forms of words in connected speech play no role in NL data collections.

The different nature of categories and time functions

The last difference, and the most important one, must be looked at from two different angles. The first thing to understand is that the relevant category of the data (that determines its collection) is already inherently given in the case of NL, but totally unknown in the case of physically recorded speech. The ASCII-symbols of a given text are elementary categories by themselves, they are directly used to form syntactically analysable expressions for the representation of all the different linguistically relevant categories. Thus relevant categorical information can be directly inferred from categorically given data and their ASCII-representations. In contrast to this NL-situation, the data of a digital speech signal do not signalise any such categories, because they only represent a measured time function without any inherent categorical interpretation. At the present stage in the development of SLP it is even not yet possible to automatically decide whether a given digital signal is a speech signal or not. Therefore, the necessary categorical annotations for SL-data must still be produced by human workers (with the increasing support of semi-automatic procedures).
The second matter that must be considered in judging the different role of categories and time functions in speech technology is that speech signals contain far more relevant information than can be expressed by the pure text which is pronounced within a given utterance. As long as NLP can be restricted to non-spoken language processing the restriction to NL-data does not pose severe problems. But as soon as real speech utterances are to be processed in an information technology application, the other --- non- linguistic, but communicatively extremely relevant categories --- cannot be ignored. They must be represented in future SL-data collections, and much effort has still to be invested by the international scientific community to deal with all these information-bearing aspects of any given speech utterance.

Next: Applications of spoken Up: Corpus design Previous: Introduction

WWW Administrator
Fri May 19 11:53:36 MET DST 1995