next up previous contents
Next: Storage and design Up: Corpus representation Previous: Introduction

Transcription of spoken language corpora

Transcriptions  are used in many fields of linguistics, including phonetics, phonology, dialectology, sociolinguistics, psycholinguistics, second language teaching, and speech pathology. Transcriptions  are also used in disciplines like psychology, antropology, and sociology. The type of transcription  very much depends on the aim for which the transcription  is made. This aim especially determines the degree of detail that is required. For example, if a corpus has been designed to investigate the amount of time several speakers are speaking at the same time in a dialogue, a very global transcription  will be sufficient. If a corpus has been collected to initialise the training of a phoneme-based speech recognition system, one might want to have a very precise, segmental transcription . The precision of transcription  can vary between these two extremes. Detailed, segmental transcriptions of large scale spoken language corpora with many speakers and much (spontaneous) speech cannot easily be achieved. This would be too time-consuming and too expensive. Therefore, most large speech corpora are provided with verbatim transcriptions, i.e. word level, orthographic representations of what has been said, see e.g. the ATIS and Switchboard corpora. However, a medium sized corpus of read speech can be provided with a segmental transcription and even with labelling at the segmental level, see e.g. the TIMIT corpus, which consists of 630 speakers each reading 10 sentences, and also the Phondat (1990 and 1992, both read speech) and Verbmobil (1993, spontaneous speech) corpora.
An important point that should be made here is that all types of transcriptions  are the result of an analysis or classification of the speech. Transcriptions  are not the speech itself, but an abstraction from it. Yet often, transcriptions are used as if they were the speech itself. It is evident that this can only be done if transcriptions are made very consistently and precisely. In section gif, levels of transcription  are presented, starting from global orthographic transcriptions and ending with detailed phonetic transcriptions. For all types of transcriptions, we will emphasise their use in spoken language corpora meant for linguistic research as well as speech technology. In the next section some remarks will be made about transcriptions  of read versus transcriptions of spontaneous speech .

The transcription of read vs. the transcription of spontaneous speech

The transcription  of read speech  starts from the written texts, which makes them somewhat easier to perform than transcriptions  of spontaneous speech  where everything needs to be written down from scratch. In the case of read speech , planning and word seeking processes are not involved. These processes of spontaneous speech  have great impact on the speech that is produced. We all know that spontaneous speech  is not perfectly fluent: speakers produce many filled pauses, mispronunciations, false starts, and repetitions. In addition, depending on the formal setting, speakers will use colloquial speech and sloppy pronunciations. These properties of spontaneous speech  make that all types of transcriptions , global as well as precise ones, are more difficult to perform for spontaneous speech  than for read speech . In case of read speech , the written texts will induce less disfluencies and less sloppy pronunciations.

Another important distinction between read and spontaneous speech  in relation to transcriptions  is that for read speech  it is clear what an utterance is: the written sentence, mostly starting with a capital letter and ending with a period, question or exclamation mark. For spontaneous speech  this is not immediately clear. Depending on the type of spontaneous speech , one has to define the criteria for utterances. For dialogues and other forms of conversation in which more than one speaker is involved, it is usual to let utterances more or less correspond with speaker turns  (see the Guidelines issued by the Text Encoding Initiative (TEI)  in =1 (

; Sperberg-McQueen Burnard 1994) and Switchboard). For monologues , utterances can be defined as stretches of speech mostly preceded and followed by a pause and having a more or less consistent syntactic, semantic, pragmatic , and prosodic  structure (see the criteria developed by the Network of European Reference Corpora (NERC) in =1 (

; French 1991) and =1 (

; French 1992) and the Dutch Speech Styles Corpus ( =1 (

; Den Os 1994) )).

Transcription of dialogues

When two (or more) persons are conversing together interruptions frequently occur. This is true for informal conversations between friends, for formal information requests, for face-to-face situations, and for telephone conversations. These interruptions may be complete utterances, but also e.g. affirmative ``yes'', or ``mm''. These interruptions resulting in simultaneous speech, must be annotated in the transcriptions . If a dialogue  between two persons is concerned, it is possible to indicate simultaneous speech in a clear way. For example Switchboard uses ``#'', one before and one after each of the segments spoken at the same time, to indicate that the two speakers of the telephone conversation speak at the same time, e.g.
A: # Right, bye #
B: # Bye bye #
However, when more than two speakers are conversing, it is not possible to indicate the interruptions and simultaneous speech in a clear and simple way. For these cases a so-called ``score notation'' can be used. As for music score notation, the different speakers are given a separate track, one above the other. The tracks must be synchronised. A computer programme ``syncwriter'' that handles these types of conversation has been developed for the Apple Macintosh. Information about this programme can be obtained at ger.xse0014@applelink.apple.com.
It is also possible to collect dialogues  in which simultaneous speech is avoided. In part of the Verbmobil corpus the dialogue  partners are recorded separately; the partners press a button when they start to speak and this directs the recordings. The recordings are made in two rooms, separated by a glass screen, so that the speakers can see each other. The speakers can hear each other by means of head phones. Of course, this situation is not as natural as when both speakers are allowed to speak at the same time.

Levels of transcription

All types of speech, whether read or spontaneous, whether monologues or dialogues, can be transcribed at different levels. The most global one, the orthographic transcription , will be used in the case of large corpora, or in cases in which the research questions can be answered on the basis of a global indication of what has been said. More precise transcriptions  will be used for smaller corpora. A distinction can be made in the following types of transcriptions:

An orthographic transcription  gives a verbatim description of what has been said using the normal spelling of a language. In a phonemic transcription  each symbol refers to a single phoneme, in whatever context this phoneme occurs. In an allophonic transcription  a single phoneme may be represented by different symbols, depending on the structural or environmental context in which the phoneme occurs. Sometimes a phonemic transcription is called a broad transcription. Phonemic and allophonic transcriptions provide generalised, abstract statements about the pronunciation of a language or dialect. The transcriptions  are also called systematic transcriptions, since they are used to make explicit observations on rule-based generalisations about regular, patterned activities in the accent concerned ( =1 (

; Laver 1994 : 550 ) ). A phonetic transcription  makes more or less specific comment on phonetic details of pronunciation. Therefore, it is sometimes called a narrow transcription . A combination of phonemic and allophonic transcription is possible, e.g. when only the allophones of certain phonemes are given. In a prosodic transcription  prosodic features like intonation, rhythm, and accents are presented. In the following we will deal with the different levels of transcription in relation to spoken language corpora.

Orthographic transcription

By orthographic transcription  (or transliteration ) we mean that the standard spelling of a given language is used to indicate the spoken words. Orthographic transcriptions  are used in large scale speech corpora and in corpora used for research in which details about pronunciations of words are not important, e.g. in lexical research. In the first case a detailed transcription of at least part of the corpus is perhaps desirable (especially if such a corpus is used for the training of speech recognition systems), but because of the huge amount of work this cannot always be done. In the latter case precise transcriptions are not even necessary.
The use of orthographic spelling necessarily implies a compromise between the sounds heard and what has to be written down. Especially for spontaneous speech the difference between what is heard and the symbolic representation may be large.

Reduced word forms

Because of the discrepancy between what is heard and what has to be written down, it is has been decided in many spontaneous speech corpora to indicate reduced word forms. One should use the reduced word forms which are given in the standard dictionary of a given language. However, for the sake of consistency one is sometimes forced to use forms which are not present in the dictionary. For example, in German prepositions and articles are written as a single word in case of a reduction of the number of syllables ( ``zu der'' is written as ``zur''). These are existing forms in the Duden dictionary. However, in the Verbmobil corpus one is allowed to write ``fürn'' for ``für den'', although this form does not exist in the dictionary. Also for the Dutch Speech Styles corpus it was decided to indicate reduced word forms. Criteria for indicating reduced forms in a transliteration may be a) frequency of occurrence of these forms and b) reduction in the number of syllables.

Dialect forms

Even in speech corpora which are meant to collect the standard variant of a given language, speakers may have their own idiolect or may use words that have a dialect  basis. These words have to be marked in the transcription . Verbmobil chose to indicate dialect  words which are not in the Duden dictionary in an orthographic way. It is possible to give information about the meaning behind these words, for example:
moin, moin <; norddeutsche Grusformel>, that is the dialect form of ``good morning'' spoken in the North of Germany.

Numbers

In orthographic transliterations numbers are usually not represented as digits, but they are written down as words. In some cases it is decided to deviate from the standard spelling in order to avoid very long words. For example, in the Verbmobil corpus the numbers 13 till 99 as well as the hundreds from 1 till 19 (the years) are written as a single word. However, all the other numbers are written separately, not conforming to the rules of the dictionary. Thus:
1993: neunzehnhundert dreiundneunzig
349614: drei Millionen neunundvierzig tausend sechshundert vierzehn
349: dreihundert und neunundvierzig

Abbreviations and spelled words

In orthographic transcriptions one usually writes the full forms of abbreviations: ``a.o.'' is written as ``among others'', German ``usw.'' is written as ``und so weiter''. Abbreviations which are spoken as words are spelled as words, e.g. Benelux, OPEC, NATO.
In spoken language corpora like Polyphone , one type of items speakers had to read was spelled words. Words can be spelled in different ways: the names of the letters can be pronounced, like A, B, C, but one can also use words beginning with the letter concerned, like Adrian, Bernard, Christian. In transcriptions spellings have to be indicated, also when they do not concern whole words, but only part of words: USA-trip, Vitamin-C. In the Vermobil corpus, spellings are indicated by capitals preceded by a space (except in case there is a hyphen) followed by $: $U $S $A-trip, Vitamin-$C.

Interjections

Interjectives like ``au'', ``ah'', ``oh'', and ``mm'' , or the French ``hein'' must be indicated according to the standard spelling of these forms in the given language. If there is no standard spelling for a certain interjective, one has to decide for a certain spelling.

Orthographic transcription of read speech

When orthographic transcriptions  are used for corpora containing read speech , the original written text may function as the default transcription. The transliteration indicates how well the written text was read by the speaker. For single words or short sentences, speakers will make relatively few mistakes, as has been experienced with e.g. the Dutch POLYPHONE  Corpus and the Groningen Corpus. For read texts (even short ones) it turns out that speakers often make mistakes. These mistakes are mostly related to deletions  of words, false starts, or hesitations. They may also use words that were not in the texts or they may use a different word order. Furthermore, speakers may mispronounce words. For example, they may add syllables , leave them out, or they may scramble syllables .

Depending on the use of a corpus of read  as well as spontaneous speech , these disfluencies must be indicated in the transcriptions. If the corpus will be used for initialising speech recognition systems for instance, all sounds must be indicated, including hesitations, filled pauses, etc. One can also think of research on reading errors which will need these disfluencies to be written down. On the other hand, if a corpus will be used to find the type of syntactic structures that people use in a certain dialogue system , it is not necessary to indicate all events that occur in the signal. In section gif we will give an overview of annotations which can be used in a transcription. These annotations will concern types of (non-) verbal sounds by the speakers (hesitations, coughing, laughing), but also types of background noises (slamming of doors, ringing of telephones).

Phonemic transcription

By phonemic/phonotypical  transcription  we mean transcription  based on the phonemes  of a given language. A phonemic transcription  is sometimes called a broad transcription . Of course, phonemic transcriptions  cost more time than orthographic transcriptions . One has to listen more precisely to what has been said. For phonemic  transcriptions , the IPA  (International Phonetic Association) system is mostly used. SAMPA  is considered to be a computer version of part of the IPA system  (see Appendix A). SAMPA  is based on ASCII characters. There are also other computer versions, as the one indicated in the Table.

It is possible to convert an orthographic transcription  into a phonemic  representation by means of automatic grapheme-to-phoneme conversion. Of course, one cannot be sure that the phonemic  representation that has been obtained in this way, is the one that was actually pronounced by the speaker. Possible variations in the pronunciation of words are not incorporated in the present text-to-speech systems . For many purposes, the automatic grapheme-to-phoneme  conversion will be sufficient. However, for initialising training of speech recognition systems this level of transcription  is perhaps too global.
It is possible to use the automatically obtained phonemic representation as the starting point of the phonetic transcription of what has actually been pronounced.

Allophonic transcription

Phonemic transcriptions  focus on the phonemic contrasts in the language, whereas allophonic transcriptions  focus on the ways in which spoken material reveals differences associated with structural and environmental context. If different symbols are used for a single phoneme when this phoneme occurs in different contexts, one speaks of an allophonic transcription . It is possible to combine phonemic and allophonic transcriptions, e.g. by indicating only the allophonic variants of a subset of phonemes.

Phonetic transcription

A phonetic  transcription  is even more precise than a phonemic  one. A phonetic  transcription  represents the pronunciation of words of individual speakers. No attention is paid to phonological theory. Allophonic  and phonetic  transcriptions are sometimes called narrow transcription .
Diacritical marks like the ones for labialisation, nasalisation , synchronic articulation are used to indicate the exact pronunciation of sounds. Furthermore, special signs are used to indicate length, stress , and tones . This narrow phonetic  transcription  is very time consuming to make and cannot be done for large speech corpora. It has been found that providing reliable phonetic  transcriptions  for large corpora is hardly feasible (cf.

=1 (

; Cucchiarini 1993) ). However, detailed transcriptions  of a small number of specific phenomena (e.g. presence/absence of diphthongation , voiced /voiceless  character of fricatives ) can be made relatively fast and reliably if the occurrences of these phenomena can be retrieved quickly with the aid of annotation  and direct access to files offered in a computerised speech corpus (cf. =1 (

; Van Hout 1989) and =1 (

; Van Bezooijen Van Hout 1985) ).

The CRIL convention

CRIL (Computer representation of individual languages)  conventions have been defined and proposed by a working group at the 1991 Kiel convention of the International Phonetic Association . The conventions consist of two parts:

The first component has been introduced to enable broader use and dissemination of the descriptional IPA  categories; i.e. the IPA  symbols as well as the IPA  diacritics needed for the narrow transcription  of regular and defective speech utterances.

The second component of CRIL  is devoted to a standardised representation of natural speech productions and introduces three systematically distinct levels for specifying what could be called the text of a spoken utterance. These levels are:

  1. Orthographic level
    This level contains the orthographic representation of the spoken text.
  2. Phonetic level
    This level specifies the phonetic form of a given word in its full (unreduced) segmental  structure which only appears when a word is spoken in isolation.
  3. Narrow Phonetic level
    This level gives a narrow phonetic  transcription  of the actually spoken words. It is only on this level that phonetic categories can be directly related to the speech signal itself.
The CRIL  conventions have been shown to work very well for the characterisation of speech data in the German PHONDAT and Verbmobil-PHONDAT corpora.

During the International Conference on Spoken Language Processing (ICSLP) in Banff in 1992 a workshop was held on Orthographic  and Phonetic  Transcription . The workshop goals were to agree on areas where community-wide conventions are needed, to identify and document current work, and to establish a means of future communication and continued cooperation. The notes from these workshop are available through anonymous ftp (cse.ogi.edu).

Prosodic transcription

Usually, some kind of prosodic  information is indicated in the levels of transcription  mentioned above. This is especially true for narrow phonetic  transcriptions , in which attention is paid to stress  and tones . But even in orthographic transcriptions  (ATIS, Switchboard, Dutch Speech Styles Corpus) some prosodic  variables are marked, such as lengthening of sounds, pauses in words and utterances, emphatic stress , and intonational  boundaries. A systematic and theory-based system for transcribing the intonation  patterns and other prosodic  aspects of English utterances, the ToBI  system, has only recently been developed.

ToBI transcription system

For an overview of existing prosodic transcription  approaches, see =1 (

; Llisterri 1994) . Here, we shortly present the ToBI transcription system.
The ToBI  transcription system (Tones and Break Indices) (cf. =1 (

; Silverman et al. 1992) ) is meant to be a standard for prosodic  transcriptions . The system makes use of speech files, fundamental frequency contours, and label files. A ToBI  transcription  of an utterance (next to the speech recording and a fundamental frequency contour) consists minimally of symbolic labels for the following four tiers : orthographic tier , tone  tier , break-index tier , and miscellaneous tier . The orthographic tier  specifies the words in the utterance using ordinary orthography . The tone  tier  specifies the tonal properties of the F0  contour of the utterance. The break-index tier  specifies the degree of disjuncture between words in the orthographic transcription , and the miscellaneous tier  is used for additional ToBI  notations and for individual or local additions. The tone  and break-index tiers  represent the core prosodic  analysis. The system does not handle all types of prosodic  factors, but it is flexible enough to be adjusted to one's own needs. It is recommended to use this system if prosodic  information is necessary for a corpus. Information about the ToBI  system, as well as guidelines, can be obtained by anonymous ftp (kiwi.nmt.edu).

Segmentation and Labelling

(briony.v1 input by Tib

Definition and motivation

Definition

Segmentation is the division of a speech file into non-overlapping sections corresponding to physical or linguistic units. Labelling is the assignment of physical or linguistic labels to these units. Both segmentation and labelling form a major part of current work in linguistic databases.

The term `transcription' may be used to refer to the representation of a text or an utterance as a string of symbols, without any linkage to the acoustic representation of the utterance. This was the pattern followed by speech and text corpus work during the 1980s, such as the prosodically-transcribed Spoken English Corpus ( =1 (

; Knowles Taylor Williams 1995) ). These corpora did not link the symbolic representation with the physical acoustic waveform, and hence were not fully machine-readable. A recent project, MARSEC ( =1 (

; Roach et al. 1993) ), has generated these links for the Spoken English Corpus such that it is now a segmented and labelled database. This is the form that is most useful to researchers in speech and language technology.

The types of segments that may be delimited are of various kinds, depending on the purpose for which the database is collected. The German PHONDAT and Verbmobil-PHONDAT corpora use the CRIL (Computer representation of individual languages) conventions formulated by a working group at the 1991 Kiel convention of the International Phonetic Association. These conventions propose three levels of representation: orthographic, phonetic and narrow phonetic. The orthographic level contains the orthographic representation of the spoken text. The phonetic level specifies the phonetic form of a word in citation form. The narrow phonetic level gives the phonetic labelling of the particular token of the word that was recorded.

A more detailed system of levels of labelling has been proposed by =1 (

; Barry Fourcin 1992) , which includes the above three levels. Each given speech corpus will choose one or more of these levels, which are described in detail under gif below, and which grew out of the SAM project for the major European languages.

The format of label (transcription) files varies widely across research institutions. The WAVES format is becoming popular, and has the advantage of being human-readable. It is advisable to use a label file format that can easily be converted to a WAVES label file, for the sake of portability across different systems.

During the International Conference on Spoken Language Processing (ICSLP) in Banff in 1992, a workshop was held on Orthographic and Phonetic Transcription. The workshop goals were to agree on areas where community-wide conventions are needed, to identify and document current work, and to establish a means of future communication and continued co-operation. The notes from this workshop are available by anonymous ftp from cse.ogi.edu.

A caveat

The treatment of speech as a sequence of segments to be delimited is to some extent a convenient fiction, made necessary by the requirements of speech technology. For example, it is notoriously difficult to define the boundaries between vowels and glides, or between a vowel and a following vowel. in addition, information about the place of articulation of a consonant is usually contained in its neighbouring vowels rather than the consonant itself. In the case of place assimilation, electropalatographic studies have shown that there is often a residual gesture towards the underlying segment ( =1 (

; Nolan 1987) ). Hence one cannot describe the speech signal as a simple string of discrete phones in absolute terms.

Notwithstanding the above, =1 (

; Roach et al. 1990) argue that the attempt to segment speech is valid, as many segments (especially some consonants) have very clear acoustic boundaries. Where clear acoustic boundaries do not exist in the speech signal, selecting a fairly arbitrary point is better than doing no segmenting at all, from the viewpoint of speech technology research. Since segmented corpora are required for training HMM-based recognisers, problems of this kind could be cancelled out by including a great deal of data of the problematic kind, so as to avoid skewing the statistical models with only one view of the boundary location.

Use in speech and language technology research

A segmented and labelled speech database is needed for training the HMM models used in many recognisers, as well as for testing them. In addition, such a database provides the raw data needed for deriving rules for text-to-speech synthesis (rules of duration, intonation, formant frequencies, etc.). In HMM-based speech recognition, vast amounts of training data are needed, especially for speaker-independent systems. Thus it is important for the level of segmentation and labelling to be accurate and reliable, as well as consistent across the entire speech database.

Use in linguistic research

A segmented and labelled speech database is also a primary resource in basic linguistic research, particularly in the case of little-researched languages. Such a database can yield fundamental information on the acoustic parameters of speech sounds of the language, as well as more detailed information on such things as patterns of duration variation according to linguistic context.

Levels of segmentation and labelling

The various levels of segmentation described below are based on those outlined in =1 (

; Barry Fourcin 1992) and =1 (

; Autesserre Pérennou Rossi 1989) , with some slight differences.

Recording script

In the case of speech read from a script, the simplest and quickest level of annotation is the orthographic form of the words in the recording script, as these are readily available. However, most researchers will need more detail than this, and so other levels must also be considered.

Orthographic

Even in the case of read speech, the orthographic level will be needed, since the speaker will not always follow the recording script exactly. The speaker may insert, delete or transpose material, and there may be false starts and self-corrections. The orthographic level will include the orthographic form of what the speaker actually said (this will be the initial level in the case of spontaneous speech). Decisions will have to be made on how to spell non-standard vocalisations, abbreviations, dialect words etc., and also on where to divide words in the case of languages with many compound word forms. For example, contractions (such as English ``I'm'', ``won't'') may need to be included in a small lexicon of anomalous forms, used together with the main lexicon when transcribing the speech orthographically. During the transcription of spontaneous monologues in the Dutch Speech Styles Corpus, this small list was found to be of a manageable size (about 30 items). The comparison of this level with the level of the recording script may be of interest to linguists studying speakers' language performance, but otherwise the level of the recording script is not of immediate interest.

Recommendation 1

When transcribing a corpus orthographically, it is advisable to generate a list of all unique word forms found in the transcription. This list will then form the input to a grapheme-to-phoneme conversion module (which may involve accessing a phonemic dictionary and/or running letter-to-sound rules). The output of this module will be the (citation) phonemic form of the speech, which can form a basis for later adaptation to various accents of the same language.

Morpho-syntactic

The database may also be labelled at the morpho-syntactic level. This is of particular interest to those wishing to study the relationship between prosody and syntax. The researcher will label such items as clause and phrase boundaries, compound word internal boundaries, and possibly affix boundaries also.

Citation-phonemic

The citation-phonemic level (also referred to as the `phonemic' level by =1 (

; Barry Fourcin 1992) may contain the output phoneme string derived from the orthographic form (by lexical access, by letter-to-sound rules, or both). There are various possibilities for representing the phoneme symbols. Some platforms have the facility to display the full range of IPA symbols, such as the symbols used by the latex font wsuipa11 (see table below).

[ insert latex source for original Table 2.1 here, and re-number it ]

However, many researchers will need to use some other means, such as an alphabetic or numeric representation of IPA symbols. The numeric equivalents of all IPA symbols are displayed in =1 (

; Esling Gaylord 1993) . An alphabetic equivalent system is used on the newsgroup sci.lang (see Appendix A), comprising all IPA symbols.

If the requirement is narrowed to symbols only for the main European languages, then the SAMPA system (see Appendix A) will be sufficient. This system has the advantage that, for any given language, only one grapheme is used per phoneme, with no spaces in between. Other systems that have been proposed (principally for English) allow a string of two or more graphemes to represent a phoneme, but a space between each phoneme representation is then necessary.

In the case of English, there are still more alphabetic systems that have been used in the past, such as (for American English) ARPABET and KLATTBET, and (for British English RP) Edinburgh's Machine-Readable Phonetic Alphabet, all of which use short grapheme strings separated by a blank space. However, a language-specific set of alphabetic phoneme symbols has not yet been devised for all possible languages.

Recommendation 2

If the corpus is confined to one language, and if the labelling is to be alphabetic rather than true IPA symbols, then it is advisable to use a language-specific set of characters. This avoids the notational complexity necessary when all symbols must be kept distinct across all languages, as is needed in the study of general phonetics.

Broad phonetic (or phonotypical)

The citation-phonemic labelling will not show such phenomena of running speech as place assimilation, consonant deletion, French liaison, RP linking-r and vowel reduction, etc. Hence there is a need for this more detailed level, which may (at least initially) be generated entirely by phonological rules from the citation-phonemic representation. It uses only symbols that have the status of phonemes, marking the output of connected speech processes that either insert or delete phonemes, or transform one phoneme into another.

It would be possible to segment and label a small sample of speech manually at this level of labelling, and then use this speech to train HMM models. The models would then be used to segment the main part of the database automatically. The resulting segmentations could be used as input to a process of manual post-editing, which would also be at the broad phonetic level, and ought not to require too much manual intervention. Alternatively, if the database is very large, there would be no manual intervention (in which case a certain loss of accuracy would have to be accepted).

This is the level referred to as `phonotypical' for French labelling by

=1 (

; Autesserre Pérennou Rossi 1989) . It may be derived initially from the citation-phonemic labelling by phonological rules which, for example, reduce or delete unstressed vowels in English, or which delete some cases of word-final schwa in French. At this stage, it does not require reference to the sounds actually produced by the speaker (though this may come later, in manual post-editing). The `broad phonetic' level of =1 (

; Barry Fourcin 1992) is intended as a phonemic-level representation of the speaker's tokens, ie. the labeller makes reference to the speech signal. The `phonotypical' level of =1 (

; Autesserre Pérennou Rossi 1989) is intended to be derived purely by rule. Both approaches are possible: however, in the context of speech recogniser training, there are certain advantages in deriving this level of labelling purely by rule, as this is relatively quick and easy. This level of labelling uses a limited inventory of symbols (and so it is a practical proposition for very large databases), while also offering more phonetic detail than the citation forms. So there is a balance between accuracy of representation and ease of derivation of the representation. However, there will still be some discrepancies between completely rule-derived labelling (where the corpus is automatically segmented) and the speaker's utterance. Hence there is some loss of accuracy, which may possibly be offset by a large volume of data.

Narrow phonetic

The narrow phonetic level of labelling is the first one where the labeller cannot avoid listening to the recording and/or inspecting the waveform and spectrogram. This is because it attempts to represent what the speaker actually said at the time of recording. This consideration immediately increases the time and effort necessary, as every part of the speech must be inspected manually.

The inventory of symbols is increased to include sounds that do not have phonemic status in the language (such as a glottal stop for English, or a symbol for aspiration after a voiceless plosive). It is at this level that different allophones may be represented, as well as devoicing or voicing, and secondary articulations such as nasalisation or labialisation. One segment at this level (eg. a voiceless plosive) may correspond to more than one segment on the acoustic-phonetic level (eg. the closure phase and burst of a plosive). There are potential problems in determining the boundaries of segments at this level, especially in the case of `desequencing' or transposition of two sounds (Barry and Fourcin, op. cit.). However, this level of representation will be far more accurate as a record of what was said, and hence will yield more reliable HMM's when training a speech recogniser. This implies that not so much data will be needed, provided it is accurately transcribed at this level.

Recommendation 3

It is better not to embark without good reason on a level of labelling that requires the researcher to inspect the speech itself, as this greatly increases the resources needed (in terms of time and effort). If the broad phonetic (ie. phonotypical) level is considered sufficient, then labelling at the narrow phonetic level should not be undertaken.

Acoustic-phonetic

This level of labelling distinguishes every portion of speech that is recognisably a separate segment of the acoustic waveform or spectrogram. Hence stop closures, release bursts, and aspiration phases will all be segmented separately, as will glottal onset of a vowel, or separate voiced and voiceless portions of a fricative, nasal or stop. The labelling will be done in terms of these recognised articulatory categories, which can be easily related to the labels at higher linguistic levels. The boundaries of some segment types (glides and vowels) will to some extent be arbitrary, and a criterion for homogeneity such as ``sameness of change'' will have to be employed (eg. in the falling second formant for a palatal glide).

Physical

The most detailed level of labelling is the physical level. This does not need to relate only to an acoustic record, but could have separate tiers related to different types of input (eg. nasal transmission detectors, palatography). However, acoustic parameters are likely to be the most frequent type of representation, and different types may be needed for particular application areas (eg. filter bank output energies, formant frequencies, LPC and cepstral coefficients, fundamental frequency or electroglottograph output waveform). The physical events may be overlapping or discrete in time, with each parameter allotted a separate annotation row (eg. nasal resonance, periodicity, high-frequency noise). This level of labelling has not been generally used in speech technology research to date, but it has the potential to serve as a resource for developing speech synthesisers with greater `naturalness', and speech recognisers which include more speech knowledge in their algorithms.

Non-linguistic phenomena

Another level of labelling may be used for non-speech phenomena. This includes speaker noises such as coughing, laughter, and lip smacking, as well as extraneous noises such as the barking of dogs and the slamming of doors. In addition, this level could also be used to label paralinguistic information such as disfluencies and filled pauses. The type of representation used for such annotations will depend on the purpose of the database. An annotation system such as that proposed by the Text Encoding Initiative is very elaborate and makes heavy demands on a transcriber, but also makes it possible to derive all relevant information from a transcription. While the TEI system makes use of SGML, which guarantees that existing software can be used, there is a large initial learning curve for the transcriber, which multiplies the possibility of human error in the transcription. Other annotation systems (such as those used in ATIS and Switchboard) are less elaborate, but also easier for transcribers to learn. The conventions used in ATIS, Switchboard, Polyphone and the Groningen corpus consist of different types of brackets with possible additional glosses. Retrieval software referring to these particular annotations must be designed in a more or less ad hoc way, which is less convenient than the TEI system. However, it is possible to provide standard scripts for a speech corpus. It is important to find the correct balance between the sophistication of the annotation system and the practicality of the system from the transcriber's point of view.

The following types of phenomena could conceivably be annotated on this level of representation:

Omissions (in read text)
: Words from the reocording script which were omitted by the speaker may be indicated. In spontaneous speech, it is very difficult to know whether a speaker has omitted words which he or she actually intended to say, and so omission is only relevant in the case of read speech.
Verbal deletions and corrections (implicit or explicit)
: Words that are verbally deleted by the speaker may be indicated. These are words that are first uttered, then superseded by subsequent speech. This can be done explicitly, as in ``Can you give me some information about the price, I mean the place where I can find it''. Alternatively, it can be done implicitly, as in ``Can you give me some information about the price, place where I can find it''. Verbal deletions may be indicated in both read and spontaneous speech.
Word fragments
: Word fragments comprise one or more sounds belonging to one word. For example, in ATIS word fragments are indicated by a hyphen, as in ``Please show fli- flights from Dallas''.
Unintelligible words
: Sometimes only part of a word is unintelligible, in which case only the intelligible part is transcribed orthographically. If a word is completely unintelligible, this fact will be annotated on the level of `non-linguistic phenomena'.
Hesitations and filled pauses
: Filled pauses (such as `uh' and `mm') may be indicated. Some annotation conventions (eg. Polyphone and Switchboard) annotate only one or two types of filled pause (`uh' and `mm', or purely `uh'). Other systems (eg. ATIS and Speech Styles) annotate more than two types (eg. `uh', `mm', `um', `er', `ah'). The types of filled pause vary across languages (for example, the English `er' is not used in Dutch). The recommendation is to use at least two types: one vowel-like type `uh', and one nasal type `mm'.
Non-speech acoustic events
: These can be made either by the speaker or by outside sources. The first category includes lip smacks, grunts, laughter, heavy breathing and coughing. The second category includes doors slamming, phones ringing, dogs barking, and all kinds of noises from other speakers. The Switchboard corpus uses a very extensive list of non-speech acoustic events, ranging from bird squawk to groaning and yawning. The recommendation is that these events are annotated at the correct location in the utterance, by first transcribing the words and then indicating which words are simultaneous with the acoustic events.
Simultaneous speech
: For dialogues and interviews, words spoken simultaneously by two or more speakers may be indicated.
Speaking turns
: Discourse analysis makes use of indications of different speaking turns and initiatives. While these are not generally used in speech technology, it would always be possible to transcribe them.

General recommendations for transcribing

The following are some general recommendations of a fairly practical kind.

Utterance information
: The recommendation is that transcribers annotate information about the process of transcribing, as some indication of the difficulty of transcribing may be useful in later analysis. The transcribers of the Switchboard telephone corpus annotated (on a scale of one to five) the following characteristics of a conversation: difficulty, topicality, naturalness, amount of echo from speaker B/speaker A, static on A/B, background noise on A/B.
Labelling procedure
: Where labels are annotated at more than one linguistic level, the recommendation is to listen to only one level at a time. This is because, in ordinary conversational listening, hearers mentally delete all hesitations, false starts etc. and do not analyse intonation explicitly. However, transcribers must learn to hear all these events in an analytical way. The easiest method is to listen to and label the words first, and then to label the prosodic phenomena and other annotations.
Transcriber features
: In the case of orthographic transcriptions, it is not necessary to use experienced transcribers. However, in the case of phonemic and phonetic transcriptions, the recommendation is that the transcribers should be trained phoneticians, or otherwise accustomed to listening to speech in a very analytical way.
Transcribing time
: An orthographic transcription of spontaneous speech will require about ten times real time (ie. ten times the duration of the speech in question). An orthographic transcription of read sentences will require about three times real time, while an orthographic transcription of read continuos texts will require about five times real time. A phonemic or phonetic transcription will require very many more times real time: the precise factor cannot be determined as so many variables are involved.
Checking of transcriptions
: It is always necessary to check transcriptions, and this can be done in different ways. An independent transcriber can transcribe the whole or a sample portion of the corpus. Alternatively, a non-transcriber can scan the transcription while listening to the speech, which is less time-consuming. In the latter case, the recommendation is to perform the checking procedure over the database in the reverse order to that in which it was transcribed. This is because the transcriber will have become more consistent in his/her use of the transcription system by the end of the database, and so the final part of the database will train the checker as he/she begins to work with the particular system used.

Manual segmentation

Manual segmentation refers to the process whereby an expert transcriber segments and labels a speech file by hand, referring only to the spectrogram and/or waveform. There is no automatic assistance in segmenting. The manual method is believed to be more accurate. Also, the use of a human transcriber ensures that the segment boundaries and labels (at least at the narrow phonetic level) are perceptually valid. However, there is a need for explicit segmentation criteria to ensure both inter- and intra-transcriber consistency, together with (ideally) some form of checking procedure. Sets of guidelines for manual segmentation have been developed by various projects. One such is

=1 (

; Hieronymus et al. 1990) , which uses the four levels of underlying phonemic, broad phonetic, narrow phonetic and acoustic. It also retains the same base phonemic symbol even at the acoustic level, to facilitate the automatic determination of boundaries at the phonetic level once the boundaries at the acoustic level have been determined. Much speech data (particularly in English) has been segmented and labelled entirely manually,

One possible measure of accuracy for segmentation and labelling is consistency between transcribers. =1 (

; Barry Fourcin 1992) quote =1 (

; Cosi Omologo 1991) as saying that one should not expect more than 9 0 =1 (

; Eisen 1993) investigates inter-transcriber consistency for the separate tasks of segmentation and labelling at three different levels of labelling, and concludes that consistency depends partly on the degree of abstraction of the labelling level, and partly on the type of sound involved. The best results in labelling were achieved at the broad phonetic level, for fricatives, nasals and laterals, which showed greater than 9 0The best results for segmentation were achieved at the acoustic-phonetic level, for the acoustic features `fricative', `voiced' and `vowel-like'.

Recommendation 4

Any accuracy measure based on inter-transcriber consistency must control for the factors `level of transcription', `segment type', and `task type' (whether segmentation or labelling).

Automatic and semi-automatic segmentation

Automatic segmentation refers to the process whereby segment boundaries are assigned automatically by a program. This will probably be an HMM-based speech recogniser that has been given the correct symbol string as input. The output boundaries may not be entirely accurate, especially if the training data was sparse. Semi-automatic segmentation refers to the process whereby this automatic segmentation is followed by manual checking and editing of the segment boundaries.

This form of segmenting is motivated by the need to segment very large databases for the purpose of training ever more comprehensive recognisers. Manual segmentation is extremely costly in time and effort, and automatic segmentation, if sufficiently accurate, could provide a short cut. However, it is still necessary for the researcher to derive the correct symbol string to input to the autosegmenter. This may be derived automatically from an orthographic transcription, in which case it will not always correspond to the particular utterance unless manually checked and edited. The amount of inaccuracy that is acceptable will depend on the uses to which the database is to be put, and its overall size.

There will always be a need to verify the accuracy of an autosegmented database, and the obvious accuracy measure is the consistency between manual and automatic segmentation over a given subset of the database. =1 (

; Schmidt Watson 1991) carried out this evaluation over nearly 6000 phoneme-sized segments, and found that the discrepancy between manual and automatic boundaries varied across segment types. The absolute mean discrepancy was greatest for diphthongs (5.4 ms) and least for nasals (0.37 ms). For 5 0all segmentations, the discrepancy was less than 12 ms, while for 9 5less than 40 ms. This falls within the range of just-noticeable differences in duration for sounds of the durational order of speech sounds ( =1 (

; Lehiste 1970 : 13 ) ) and so one could conclude that the discrepancies are not perceptually relevant.

This means that automatic segmentation for the given data, using the given autosegmenter, was probably sufficiently accurate.

Segmentation and labelling in the VERBMOBIL project

In the German VERBMOBIL project, segmentation and labelling of recorded speech data is a fundamental part of the research. The following procedures are adopted. The phonemic labels are based on the SAMPA symbols for German, augmented by extra labels for phonetic segments such as plosive release, aspiration after a plosive, creak, and nasalisation of a vowel as a reflex of a deleted nasal. Hence the labelling is carried out at the narrow phonetic level.

During the labelling process, a label will be aligned with the start of the portion of speech that is considered to represent its chief acoustic correlates. Labels are discrete and non-overlapping, except in the following cases:

  1. Labels for creak and nasalisation are always superimposed on other labels, which they modify.
  2. A special label (--MA) is used to indicate that the phonetic correlates of one or more deleted segments are present in the surrounding material. For example, where an unstressed rounded vowel has been deleted, labialisation may still be present in a neighbouring consonant, and will be marked in this way.

Inter-labeller consistency is maintained in three ways, as follows:

  1. The inventory of possible labels is restricted mainly to the list of German phonemes. This restriction minimises the possibility of error.
  2. The labeller works from a citation-phonemic form of the utterance that has been previously prepared. This eliminates gross errors.
  3. There are restrictions on the types of modification allowed. The labeller is permitted to mark the following: deletion (where the initially-provided label is marked with a following hyphen, to indicate deletion); insertion (where the new label is prefixed with a hyphen); and substitution (where the new label is inserted after the one initially provided, separated by a hyphen).

The checking of segmented and labelled speech files is carried out partly by a program that detects invalid sequences of symbols, and partly by experienced labellers checking the work of less experienced transcribers. All segmenting and labelling is carried out manually. The initial citation-phonemic transcription is derived from the grapheme-to-phoneme converter of the Infovox TTS system (and then is checked for mistakes). A system of prosodic labelling is currently under development.

Prosodic labelling and annotation

Definition and motivation

Types of approach to prosodic labelling

The above discussion has been in terms of segmental labelling only. It is also possible to annotate a speech database at the prosodic (suprasegmental) level. This is less straightforward than segmental annotation, as there are far fewer clear acoustic cues to prosodic phenomena. The F0 curve will be the relevant acoustic display, possibly augmented by the intensity curve. The waveform is a useful guide to the current location in the speech and is usually displayed together with the F0 curve (as in the WAVES labelling software).

The units segmented will depend on the particular theoretical bias underlying the given research programme. A basic distinction may be drawn between a prosodic labelling system that annotates the boundaries of units (analogous to the method used in segmental annotation) and a system that annotates the occurrence of isolated prosodic events (such as F0 peaks) where the status of the units bounded by these annotations is of little interest.

The first type of method may possibly use the intonational categories proposed by =1 (

; Nespor Vogel 1986) , such as intonational phrase, phonological phrase, phonological word, foot, and syllable. Alternatively, it could mark the more traditional units of `minor tone-unit' and `major tone-unit', as in the MARSEC database ( =1 (

; Roach et al. 1993) ). Or again, it could annotate the perceptual phonetic categories used in the `Dutch school' of intonation studies, such as rises and falls that are early, late or very late in their timing, fast or slow in their rate of change, and full or half sized ( =1 (

; `t Hart Collier Cohen 1990) ). This type of annotation could be used in conjunction with annotation at the morphosyntactic level to yield information about the relationship between the syntactic and prosodic levels in terms of duration, pauses, etc.

The second type of method, though it may refer to the units mentioned above in its underlying theory, does not in fact annotate them but rather marks the occurrence of high and low tones of various kinds. The recently formulated ToBI transcription system ( =1 (

; Silverman et al. 1992) ) is the most well-known system of this kind for English.

Examples of the two types of approach

Prosodic annotation has only recently come into favour in the field of speech and language technology research. Now that a basic level of competence has been achieved as regards the synthesis and recognition of speech segments, researchers have come to realise that much more work is required on the prosodic aspect of speech technology. This is the motivation for the growth in popularity of speech database research, and for the formulation of the ToBI prosodic transcription system. In order for the prosodic transcriptions of various different speech databases to be comparable, and in order to make the best use of existing resources, the originators of ToBI (Silverman et al., op. cit.) proposed a simple system that would be easy to learn and that would lead to good inter-transcriber consistency. To date it has largely been used for English, especially American English, but at least in principle it could be extended to other languages as well. The system has certain severe limitations (eg. it has no way of representing pitch range) but its minimalist formulation was dictated by the need for learnability and consistency in use. The `British school' type of system used in the MARSEC database of British English ( =1 (

; Roach et al. 1993) ) contains more phonetic detail but may require more effort in teaching to novice transcribers. The `IPO' classification of F0 patterns ( =1 (

; `t Hart Collier Cohen 1990) ) has not yet been used systematically in the annotation of large-scale publicly-available speech corpora, but has been used successfully in the development of speech synthesisers.

Prosodic transcription also has obvious uses in basic linguistic research, especially since research into the suprasegmental aspects of language is not nearly as advanced as research into the segmental aspects. As indicated above, a database annotated at the prosodic and morphosyntactic levels can provide information on the relationship between them with respect to duration and pauses. If the segmental level is also annotated, then many possibilities open up for the study of segmental duration in prosodic contexts. This is especially true in the case of languages other than English, where these aspects have received comparatively little attention to date.

The concept of levels of prosodic labelling applies differently to the two different approaches to prosodic labelling outlined above. In the first case, the obvious categories would be those proposed by Nespor and Vogel (op. cit.), comprising levels of non-overlapping units each of which corresponds to one or more units on the level immediately below (eg. phonological phrase, foot, syllable). In the second case, the separate levels have no such intrinsic relationship to one another, but merely deal with different types of phenomena.

For example, in the ToBI system, there are separate levels for tones and inter-word `break indices'. The ToBI system can be described briefly in terms of its separate levels, and is described in what follows. The MARSEC system will be outlined after that. The `Dutch school' system of IPO will not be described in much detail, as it has not (yet) been used for annotation of publicly-available speech corpora: however, extensive references are avilable in 't Hart et al. (op. cit.).

The ToBI labelling system

A recent experiment ( =1 (

; Pitrelli Beckman Hirschnerg 1994) ) used several prosodic transcribers working independently on the same speech data, comprising both read and spontaneous American English speech. The ToBI system was used, and a high level of consistency across transcribers was found, even given the fact that transcribers included both experts and newly-trained users of the system. This suggests that the system achieves its object of being easy to learn and to apply consistently, at least in the case of American English.

The `orthographic' level of the ToBI system contains the orthographic words of the utterance (sometimes only partial words in the case of spontaneous speech). It is also possible to represent filled pauses (eg. `um', `er' at this level).

The `miscellaneous' level may be used to mark the duration of such phenomena as silence, audible breaths, laughter and disfluencies. There is no exhaustive list of categories for this level, and different transcription projects may make their own decisions as to what to annotate.

The `break index' level is used to mark break indices, which are numbers representing the strength of the boundary between two orthographic words. The number 0 represents no boundary (with phonetic evidence of cliticisation, eg. resyllabification of a consonant), and 4 represents a full intonation phrase boundary (usually `end of sentence' in read speech), defined by the occurrence of a final boundary tone after the last phrase tone. The number 3 represents an intermediate phrase boundary, defined by the occurrence of a phrase tone after the last pitch accent, while the number 1 represents most phrase-medial word boundaries. The number 2 represents either a strong disjuncture with pause but no tonal discontinuity, or a disjuncture that is weaker than expected at a tonally-signalled full intonation or intermediate phrase boundary.

The `tone' level is used to mark the occurrence of phonological tones at appropriate points in the F0 contour. The basic tones are `L' or `H' (for `low' and `high'), but these may function as pitch accents, phrase accents or boundary tones, depending on their location in the prosodic unit. In the case of pitch accents (which occur on accented syllables), there may be one or two tones, and the H tone may or may not be `downstepped'.

Information about the ToBI system and guidelines for transcribing are available by anonymous ftp from kiwi.nmt.edu.

The MARSEC labelling system

The MARSEC project ( =1 (

; Roach et al. 1993) ) is based on the Spoken English Corpus ( =1 (

; Knowles Taylor Williams 1995) ), a corpus of British English that at the time was not time-aligned. The MARSEC project time-aligns the prosodic annotations, the orthographic words, the grammatical tag of each word, and individual segments. The type of prosodic annotation used is the `tonetic stress mark' type of system. Several types of accent are recognised: low fall, high fall, low rise, high rise, low fall-rise, high fall-rise, low rise-fall, high rise-fall, low level, and high level. These may occur either on nuclear or on non-nuclear accented syllables. In addition, there is a distinction between major and minor tone-unit boundaries, and there is provision for `markedly higher' or `markedly lower' in perceived pitch. The tonetic stress mark type of system has been used for many years, and has been applied to many languages apart from English (the same is not true of the ToBI system). However, no extensive attempts have yet been made to apply it in the field of speech technology.

The Spoken English Corpus comprises over fifty thousand words of broadcast British English in various styles, mostly monologues. Two transcribers prosodically annotated it in an auditory fashion, with no access to the F0 curve. They each transcribed half the corpus, but each also independently transcribed certain passages known as `overlap' passages, the purpose of which was to check on inter-transcriber consistency. Analysis of the overlap passages reveals that the consistency is fairly good, certainly in the case of major aspects such as location of accents and direction of pitch movement ( =1 (

; Knowles Alderson 1995) ). This result is especially encouraging in view of the fact that the transcription system used contains far more phonetic detail than the ToBI system.

The IPO approach

The phonetically-based analysis of intonation used at IPO ( =1 (

; `t Hart Collier Cohen 1990) ) has the advantage of having proved its usefulness for more than one language, and of having been successfully applied in the field of speech synthesis (neither of these considerations apply to the ToBI system). The analysis proceeds by modelling F0 curves in terms of straight lines that have been experimentally proved to be perceptually indistinguishable from the original (`close-copy stylisations'). This type of representation is then further simplified into `standardised stylisations' in terms of a small set of available contours for a given language. This type of representation has been experimentally proved to be distinguishable from the original on close listening, but yet not functionally any different from the original (ie. the standardised stylisation is linguistically equivalent).

In the case of Dutch, there are ten basic pitch movements (the model has also been applied to British English, German and Russian). These are the five falls and five rises, varying along the parameters of syllable position, rate of pitch change, and size of pitch excursion. These ten pitch movements are grouped into `pitch configurations' (of one or two pitch movements each). The pitch configurations are classified into prefixes, roots and suffixes. Sequences of pitch configurations are grouped into valid `pitch contours', which in turn are grouped into melodic families or `intonation patterns' (of which there are six in Dutch). These groupings are experimentally verified by listeners. The units of this analysis, at all levels, are based on speech corpora of spontaneous and semi-spontaneous utterances in Dutch. In contrast to the ToBI and MARSEC systems, comparatively little effort has been put into checking inter-transcriber consistency, possibly because the detection and labelling of this kind of phonetic unit is less problematical.

Provisional recommendations

It is reasonable to assume nowadays that a prosodic transcriber will have access to at least the waveform and the F0 curve for the speech to be transcribed. In that case, the recommendation is to use either the ToBI or the IPO system (and the MARSEC system if a purely auditory transcription is being carried out). If the language to be transcribed is not English, and especially if the projected application of the prosodic transcription is in the field of speech technology, then it is probably best to use the IPO system if possible (i.e. if the basic `grammar' of contours has already been researched for that language). However, these can only be provisional recommendations, as little work has been carried out on prosodic labelling in comparison with the great effort that has been expended on segmental labelling. In this situation, it may be that a different system entirely will prove more appropriate to the given language, and it is not possible to make absolute recommendations.

)

Annotation conventions

Next to the symbolic representation of words (lexical transcriptions ), annotations  can be used to indicate possible events that occur during speaking. In the following, these events are mentioned. How to represent these events depends on the system one wants to use. A system like the one proposed by TEI  is very elaborate, and therefore it demands quite a lot of a transcriber. A very advanced system supplies the means to retrieve all possible information from the transcription . TEI , for example, makes use of SGML which guarantees that existing software can be used. A transcriber, however, has to learn a lot and this will increase the chance that mistakes are made. Other systems, like the ones used in ATIS and Switchboard, are perhaps less advanced, but they are relatively easy to learn by transcribers. One must keep in mind that listening to all events that occur during speaking already is a difficult task, and a great number of conventions makes transcribing even more difficult. The conventions used in ATIS, Switchboard, POLYPHONE  and the Groningen Corpus consist of different types of brackets with or without additional information. Retrieval software referring to these specific notations must be designed in a more or less ad hoc way. This is perhaps less convenient in comparison with the TEI  system. However, it is possible to provide a speech corpus with standard scripts. One must seek a balance between an advanced annotation  system and convenience for transcribers.

Notation conventions used in discourse analysis, like speaking turns  and initiatives are not mentioned here. As far as we can see now, conventions can always be added to indicate this type of information.

In the following, some occurrences that must be annotated are discussed.

Deletions (read text)

Words of the written texts which are not said by the speaker must be indicated. For spontaneous speech  it is very difficult to know whether a speaker deletes  words which he actually wanted to say. Therefore, this type of deletion  is usually not indicated when spontaneous speech  is concerned.

Verbal deletions or corrections (implicitly or explicitly)

Words that are verbally deleted  by the speaker must be indicated. Verbal deletions  are words that are actually uttered by the speaker, but which, according to the transcriber, are superseded by subsequent speech. This can be explicitly done, e.g. Can you give me some information about the price, I mean, the place where I can find..., or it can be implicitly done, e.g. Can you give me some information about the price, place where I can find.... Verbal deletions  can be indicated in read as well as spontaneous speech .

Word fragments

Word fragments must also be indicated. Word fragments may be single sounds or larger parts of a word. In ATIS, word fragments are for example indicated by means of a hyphen ( please show fli- flights from dallas).

Unintelligible words

Words which cannot be understood must be indicated. Sometimes only part of a word is not understood. In this case the part that is understood can be written down. If a word is completely unintelligible this must be represented, e.g. by putting two round brackets before and after a space (( )).

Hesitations (filled pauses)

Filled pauses like uh and mm must be indicated. Some annotation  conventions (e.g. POLYPHONE  and Switchboard) choose to indicate only one or two types of filled pauses ( uh, or uh and mm). Other systems (e.g. ATIS and Speech Styles) choose for more types of these sounds ( uh, um, er, ah, mm). The types of filled pauses are not the same for all languages. For example, the English er is not used in Dutch. It is recommended to use at least two types, one vowel type uh, and one nasal  type mm.

Non-speech acoustic events

Here the distinction can be made between non-speech acoustic events produced by the speaker and non-speech acoustic events produced in another way. Smacks, grunts, laughing, loud breath, coughing, all produced by the speaker belong to the first category. Noises  of doors, phone ringing, dog barking, and all kinds of noise  produced by other speakers belong to the second category. The Switchboard corpus uses a very extensive, precise list of non-speech acoustic events, ranging from bird squawk to groaning, and from dishes to yawning. It is recommended that non-speech acoustic events are marked at the correct location in a transcribed utterance. This must be done by first transcribing the words and then indicating during which words the acoustic events occur.

Recommendations

For the transcription  of dialogues  between more than two speakers use a ``music score notation''.

For orthographic transcriptions, use the standard spelling as much as possible.

Indicate reduced word forms in orthographic transcriptions  a) if these forms occur frequently and b) if they involve syllable deletion.

It is recommended to include a list of all non-standard spellings used in the orthographic transcription in the documentation of a speech corpus.

When orthographic transcription  is used in a corpus, it is recommended that a list of unique words and word forms is generated on the basis of the transcription . Then, the graphemic forms of the words can be converted to phonemes  by means of computerised grapheme-to-phoneme  conversion. The result of this is a list of citation  forms, also called canonical  forms. These forms indicate the pronunciation of words when spoken in isolation. Of course, the great variability in pronunciation cannot be indicated, but this procedure at least gives the standard pronunciation. This is especially relevant if a corpus will be used by other persons than those belonging to the own language community. On the basis of the canonical forms phonetic transcriptions can be made semi-automatically using large vocabulary speech recognisers. At least the attempts made so far are promising.

If there are no clear reasons, do not start to transcribe a corpus phonetically, since the time spent on this will never be paid back. If very specific phonetic details are needed, one is recommended to look for these on the basis of orthographic  and/or phonemic  transcriptions .

It is recommended that transcribers give information about the process of transcribing and about the speech they transcribed. Some speakers will be easier to transcribe than the speech of other speakers. This will depend on the speech rate, the neatness of articulation and the amount of hesitations, and the amount of dialect words used by the speakers. Some information about the difficulty of the transcription  is very useful for later queries. The transcribers of the Switchboard (telephone) Corpus had to indicate on a scale ranging from 1 to 5 the following characteristics of a conversation: difficulty, topicality, naturalness , echo from B (in listening to A separately, B could hardly be heard (1) or was as nearly as loud as A (5)), echo from A, static on A (no static noise  (1) or great deal of it (5)), static on B, background A, and background B.

In case of transcriptions  at more than one level (e.g. orthographic transcription  with some prosodic  marks and indications of hesitations, etc.), it is recommended to listen to one level at a time. In everyday life, listeners are used to skip hesitations, false starts, and other imperfections and they do not pay explicit attention to prosody . Transcribers must learn to hear all these events. It seems easiest to listen to the words first and transcribe these, and then to place the prosodic  marks and other annotations .

For orthographic transcriptions  it is not necessary to find experienced transcribers. However, for phonemic  and phonetic  transcriptions  it is recommended to look for transcribers who are used to listen to speech in a very precise way.

To give some indication about the time necessary to transcribe speech we give some examples. The time that will be necessary to make an orthographic transcription  of spontaneous speech  is about ten times the duration  of the speech itself. The time necessary for an orthographic transcription  of read sentences is about three times the duration  of the speech and for an orthographic transcription  of read texts it is about five times the duration  of the speech.

Checking of transcription  is always necessary. Checking can be done in different ways. An independent transcriber can transcribe the whole or a sample of the corpus. Another possibility is to let someone check the transcription  by reading the transcription  and listening to the speech. This is less time consuming. For the last procedure, it is recommended to check the transcription  in the reversed order as the first transcriber transcribed the speech. At the end the first transcriber will be more consistent than in the beginning. Differences can occur in the conventions used (spelling and annotation  conventions (brackets, etc.)), as well as in what is heard by the two different persons.



next up previous contents
Next: Storage and design Up: Corpus representation Previous: Introduction



WWW Administrator
Fri May 19 11:53:36 MET DST 1995