Traditionally, linguists and Natural Language Processing (NLP) researchers understood language corpora to consist of written material collected from text sources which already exist and often are available in published form (novels, stage and screen plays, newspapers, manuals, etc.). In this context the term ``spoken language text corpora'' was used to indicate that the data are not taken from existing texts but that speech had to be written down in some orthographic or non-orthographic form in order to become part of a data collection. However, the differences (and relations) between text- and speech-data are far more complex. There are at least seven important differences, which may not be ignored because they determine relevant properties of the resulting data collections. For future (technological) developments of Spoken Language Processing (SLP) they should be taken into account very seriously.
These seven differences have to do with:
A closer look at these seven differences between written and spoken data will reveal why the traditional term``Natural Language Processing '', NLP, also could well be read as standing for ``Non-spoken Language Processing ''. As it is our goal to call special attention to the relevant differences we will refer to the written language data as NL-data meaning non-spoken language data, and set it in opposition to the term SL-data, the acronym for spoken language data.
The first distinction may seem to be rather trivial but it must nonetheless be mentioned here because it exercises an influence on specific properties of the collected NL- and SL-data. While text uses to stay on the paper when it is written down, speech is transient. It is the nature of the phonetic facts which speakers create during speech acts that they disappear just in the moment they come into existence. On the other hand, text is always produced with the intention to remain where the writer puts it, be it on paper, stone, or screen.
The first difference (which in the step from speaking to writing has helped our cultural development) explains why to collect SL-data is less trivial than to produce NL-data. The former must be recorded on a tape or a disk to make it accessible for future use.
Another difference between NL- and SL-corpora is due to the fact that speech data are time functions in a sense text data are not. Whilst a writer may consume any time he wants (or needs) to invest in producing a text, phonetic information is coded and transmitted through syllabically organised sound transitions. Speech must be running in its own natural time with a typical syllable rate of a value between 120/min and 180/min. The time for writing new text is normally much longer than it takes to read it aloud (which does not mean that silent reading and short-hand-writing can be much faster than speaking the text).
In spontaneously spoken language the editing behaviour of the speaker is audible and remains a part of the recorded data. Interruptions, hesitations, repetitions of words (and parts of words), and especially self-repairs are a characteristic feature of naturally spoken language and must be represented in SL-data collections of spontaneous speech. On the other hand, the writer who has even more correcting and editing possibilities in producing a text document, will normally intend to produce a ``clean'' version. In the final version of the text all corrections which may have been carried out have disappeared; this is especially true for text intended to go into print. Thus it is extremely unusual to collect NL-data which contain all corrected errors, the replacements of words etc. that have occurred during the writing and editing of a given text document. In recording SL-data it has also not been unusual to produce similar clean speech collections. A typical example is so-called laboratory speech which is produced when a speaker who is sitting in a monitored recording room reads a list of prepared text material, and then only the proper reproductions of the individual text items are accepted to enter the data base. Examples of speech corpora collected in this way are EUROM-0 and EUROM-1. Recently, however, interest has shifted towards corpora comprising ``real-world'' speech, including hesitations, corrections, background noise, etc.
In correctly written texts any lexical item has just one and only one distinct
orthographic form. Thus the words of European languages are
easily
identified and, with the exception of a few homographs, also well
distinguished from each other, and there is usually also only exactly one
version of each possible orthographic contextual form of any given word.
This is the source of the fourth
important difference between NL- and SL-
data which often has been neglected because it is rather inconspicuous to
the naive speakers and listeners. The spoken versions of orthographically
identical word forms show a great phonetic variation in their
segmental and
prosodic realisation. In most European languages the phonetic form of
a given word is in fact extremely variable depending on the context and
other well defined intervening variables (such as speaking style and context
of situation, strong
and weak Lombard effects, etc.). A given word can even
totally disappear phonetically, and possibly be reduced to --- and only
signalised by --- some reflection of segmental features in the prosody of the
utterance. Most of these inconspicuous
variations appear in a narrow
phonetic transcription of a given pronunciation.
It makes a great difference whether a word has been uttered in
isolation or in a chain of connected speech. It is only in the first case that
words show the phonetic
form that is described in pronunciation dictionaries.
Only if a word is consciously and very carefully produced in isolation
we observe the explicit
version of its segmental structure. These phonetically explicit forms
produced in a careful speaking
style are called citation or canonical forms.
The segmental structure of so-called citation forms is modified (probably
systematically, although very little of the system is understood) as soon as
it is integrated into connected speech. Since
this
fourth difference between NL- and SL-data is so relevant for the design
of spoken language corpora, it has also been taken account for in the
conventions of the IPA proposed for Computer Representation of Individual
Languages (CRIL).
In dealing
with SL-data one must be able to know which words the speaker
intended to express in a given utterance. This is reflected in the
CRIL-convention of the IPA, which is formally described in Appendix A. Here it
should be mentioned that an SL-data
collection should ideally have at
least two different symbolically specified levels which are related to the
acoustic speech signal. On the first level the words of the given utterance
are identified as lexical units in their orthographic form, and on
the second
level a broad phonetic transcription of the citation form should be given.
This second level may be the result of automatic grapheme-to-phoneme
conversion. For large SL-corpora it would cost too much time and too much
money to make broad
phonetic transcriptions by hand. How the given words have
been actually pronounced in a given speech signal must be specified in terms
of a narrower phonetic transcription of each individual utterance on a third,
optional CRIL-level. This third level
can then be directly aligned to the
segments or acoustic features of the digital speech signal in the data base.
And this information is especially relevant if also multi-sensor data are to
be incorporated in SL-data bases.
It will be clear that
detailed phonetic
transcriptions are extremely expensive, to the extent that they are likely to
be prohibitive in large corpora. However, recents attempts using large
vocabulary speech recognisers for acoustic decoding of speech show some
promise
that the process can be automated, at least to the extent that
pronunciation variation can be predicted by means of general phonological and
phonetic rules.
In addition to phonetic detail on the segmental level, several uses of spoken
language
corpora may also require prosodic annotation. In this area much work
remains to be done to develop commonly agreed annotation systems. Once such
systems exist, one may attempt to support annotation by means of automatic
recognition
procedures.
Taken as pure data, any written text consists of strings of printable
alphanumerical and other elements coded in 7- or 8-bit
ASCII-Bytes. But the
resulting NL-strings possess already a characteristic information structure
which is not available in the case of primary SL-data. Separated by blanks
ASCII-strings are grouped into lexical substrings, and also the
proper
punctuation of phrases and sentences is an important property of NL-data.
Nothing of this type of information can be found in the recordings of primary
SL-data since in natural speech there are no ASCII-elements representing full
stops, commas,
colons, quotation, question, exclamation marks. Recorded SL-
data are primarily nothing but digitalised time functions.
Such recorded data are of no use, unless at least the sequence
of words is known which is expressed in the recorded speech
signal. As long
as reliable automatic recognition of unconstrained speech remains an unsolved
problem, much additional work has to be invested for producing the required
symbolic annotation of the speech data. This explains why the
preparation of
useful SL-data collection takes so much effort and expense.
Comparing the pure extent of storing collected NL- and SL-data gives a great
numerical
difference. There are two reasons why SL-data require orders of
magnitude more storage space than written language corpora.
The first one is simply the difference in coding between
text and speech. Whereas the ASCII-string of a word like
`and' needs only
three bytes, many more bytes are required as soon as the phonemes of this
word are transformed into an acoustic output for storing the AD-converted
data. If in the given example we assume that in clear speech the utterance
of a
three-phoneme-syllable takes about half a second and if we apply an
amplitude quantization of 16 bits and a non-stereo hi-fi sampling rate of 48
kHz, the NL/SL-ratio amounts to approximately 1:16000.
The second reason follows from the great
variability in the phonetic forms of
spoken words. As pointed out above in difference #4, any written text must
be reproduced by many speakers in more than one speaking style (at least
in a slow, normal and fast speed with low, normal, high voice etc.),
if the
corpus is intended to reflect all possible sources of variability.
It should be clear that all these severe modifications of the phonetic forms
of words in connected speech play no role in NL data collections.
The last difference, and the most important one, must be looked at from two
different angles. The first thing to understand is that the relevant
category
of the data (that determines its collection) is already inherently given in the
case of NL, but totally unknown in the case of physically recorded speech.
The ASCII-symbols of a given text are elementary categories by themselves,
they are
directly used to form syntactically analysable expressions for the
representation of all the different linguistically relevant categories. Thus
relevant categorical information can be directly inferred from categorically
given data and their
ASCII-representations. In contrast to this NL-situation,
the data of a digital speech signal do not signalise any such categories,
because they only represent a measured time function without any inherent
categorical interpretation. At the present stage
in the development of SLP it
is even not yet possible to automatically decide whether a given digital
signal is a speech signal or not. Therefore, the necessary categorical
annotations for SL-data must still be produced by human workers (with
the
increasing support of semi-automatic procedures).
The second matter that must be considered in judging the different role of
categories and time functions in speech technology is that speech signals
contain far more relevant information than
can be expressed by the pure
text which is pronounced within a given utterance. As long as NLP can be
restricted to non-spoken language processing the restriction to NL-data does
not pose severe problems. But as soon as real
speech utterances are to be
processed in an information technology application, the other --- non-
linguistic, but communicatively extremely relevant categories --- cannot be
ignored. They must be represented in future SL-data collections, and
much
effort has still to be invested by the international scientific community to
deal with all these information-bearing aspects of any given speech
utterance.