next up previous contents
Next: SAM label file Up: SAMPA computer readable Previous: SAMPA computer readable

SAMPA

Introduction

SAMPA (Speech Assessment Methods Phonetic Alphabet) is a machine-readable phonetic alphabet originally developed under the ESPRIT project 1541 (SAM) in 1987--89 by an international group of phoneticians and applied in the first instance to Danish, Dutch, English, French, German and Italian (SAM 1988, 1989). It has since been extended to other languages, including Norwegian, Swedish, Spanish, Portuguese and Greek.

Section A.1.2 covers the present status of SAMPA, and addresses the languages Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish and Greek. Section A.1.3 discusses additional levels of annotation. A.1.4 addresses important issues to be considered in the relationship of SAMPA to other computer-coded phonetic transcription systems in use in the world. The IPA convention in Kiel, August 19--21, 1989, presented an opportunity to assess the situation.

Notation issues

As with any phonetic notation system, those who developed and applied SAMPA had to make decisions on issues of two types: transcription and coding (see discussion in Wells 1987). The first involves the selection of an appropriate phonetic symbol set; the second involves the allocation of an ASCII number to each symbol that we need, and therefore of a corresponding visual symbol chosen from the character set available on computers.

Transcription

Transcription involves many issues of principle over which phoneticians and linguists have debated for decades. These issues may be new, though, to many engineers and speech technologists. Among such issues are (i) whether the notation should be phonemic, or to some extent allophonic; and if phonemic, how the phoneme set is to be established; (ii) to what extent phonetic symbols should be required to have the same meaning across different languages; and (iii), the relation between the basic, lexical, pronunciation of a word and its actual pronunciation in context.

In principle, SAMPA provides for phonemic notation of languages. For example, the r-sounds of English rip, trip, and drip are all instances of the phoneme /r/, although different articulatory and acoustically (in voicing and in presence/absence of friction). These different allophones are predictable from the phonetic context: we can unambiguously write them all as /r/. The arguments for preferring phonemic notation to allophonic are (i) it is simpler while still being unambiguous; (ii) correct identification of allophones may be difficult for those without phonetic training; and (iii) too few codes are available in the range 32--127 to provide for all allophones.

In syllable-initial position, English /t/ is alveolar and aspirated; French /t/, dental and unaspirated; Swedish /t/, dental and aspirated. We ignore these comparative differences in our notation, writing all as /t/. SAMPA does not need to adopt distinct symbols to reflect these differences. (However, if and when SAMPA is applied to Hindi, for example, where these differences are phonemic, it would become necessary to notate them explicitly.)

In continuous speech the actual sounds used in pronouncing a word may well differ from the word's citation form (dictionary entry). A phonotypical transcription is one in which citation forms are modified in accordance with known phonetic rules of connected speech. For example, in a phonotypical transcription of English, final linking /r/ would be shown before a following vowel ( better ask) but not before a consonant ( better go); the lexical entry would be invariant. In an actual utterance the speaker might or might not conform to phonotypical expectations; an impressionistic transcription reflects a human (or mechanical) auditory or acoustic analysis of what was actually said. In the case at issue, /r/ would be shown if phonetically present in a given instance, not otherwise.

In practice, colleagues working on the various languages to which SAMPA has been applied have chosen to deviate in various respects from these principles. English has plosive /d/ and fricative / '023 / (SAMPA /D/) as distinct phonemes ( den, then). In Spanish, they are undoubtedly allophones of the same phoneme, and could unambiguously both be written /d/; but for speech technology work our Spanish colleagues prefer to notate them distinctively, as ``d'' and ``D'' respectively. The r-sounds in French rouge, lettre are different from all the English r-sounds, being respectively a voiced and voiceless uvular fricative. It would seem unambiguous and logical to write them, too, as /r/. But our French colleagues have preferred to use the distinct uvular-r symbol, also provided in SAMPA, namely /R/.

Nevertheless I believe we should as far as possible discourage allophonic and comparative notation. Bulgarian has the simple 6-vowel system, IPA /i e a o u '025 /. A colleague in Bulgaria has proposed that they be represented in SAMPA as /I, E, a, O, U, @/. About /a/ and /@/ (= IPA / '025 /) we can agree. But the other symbols he proposes are inappropriately comparative. The Bulgarian vowels should appear in SAMPA as /i, e, a, o, u, @/.

Coding

SAMPA's coding principles involve restricting the available ASCII codes to the range 32--127. At the time SAMPA was formulated, many computers used only the 7-bit ASCII character set. With the spread of PCs and compatibles, the ``extended ASCII'' (8-bit) set has become familiar, allowing codes in the range 128--255. Has the decision to restrict SAMPA to the range 32--127 proved wise? Or should we now relax it?

In the (American) English extended ASCII character set used by PCs running MS-DOS, the range 128--255 is used to provide for the screen and printer a number of accented alphabetic letters, currency symbols, graphic symbols, and Greek and mathematical symbols. Those that are not available on the keyboard can be accessed by entering their ASCII number on the keypad while pressing the Alt key. Unfortunately, from the point of view of non-English-speaking Europeans, this extended ASCII fails to provide all the accented Latin letters needed for such languages as Portuguese, Icelandic, Czech, Polish and Hungarian. To remedy this shortcoming, a number of different ``code pages'' are now available, each providing a different set of characters in the 128--255 range. In the USA and the UK most PCs use code page 437 (International English), in Western Europe 850 (Multilingual Latin I), and in much of Eastern Europe 852 (Slavic Latin II).

Applications running under the popular front-end Windows use yet another character set, one known as ``enhanced ANSI''. This is identical with the ASCII set for 33--127; for 128--255 it offers its own specific choice of accented alphabetic and other characters, with codes different from ASCII.

The consequence is that in PC-compatible computing the code numbers in the range 128--255 (the ``extended'' characters) may currently have several different interpretations. Conversely, a given character may be coded in several different ways.

Consider the IPA symbols /æ/ and / '023 /, both needed for the phonetic transcription of English. For reasons that seemed valid at the time (cf. Wells (1987: 95)), SAMPA assigned the former the code 123, which now appears on all Latin-alphabet PC screens as ``{''; the latter was coded 68, ``D''.

Both ``æ'' and `` '023 '' are now available on-screen for PCs running Windows. While ``æ'' is an ASCII character, with the extended code 145 (for those using code page 437 or 850), `` '023 '' is not. But both are in the enhanced ANSI set, with codes 230 and 240 respectively. (Hence under Windows they can be accessed, if not on the keyboard, by keying Alt+0230 and Alt+0240; ``æ'' can also be accessed as Alt+145.)

However, a PC using code page 852 (Slavic) will display code 145 as an upper-case Polish L-with-acute-accent (L), 230 as ``S'', and 240 as a dieresis ( ). With code page 860 (Portugal), 145 is ``À'', 230 ``'' and 240 ``''.

Recently a number of phonetic fonts have become available for use under Windows. These comprise only phonetic symbols (perhaps with a few punctuation signs). Unfortunately they disagree extensively on key assignment and coding. On my PC I now have three TrueType phonetic fonts provided by the Summer Institute of Linguistics and four others of whose origins, I regret to say, I am uncertain. These fonts agree with SAMPA (but not ANSI) in assigning `` '023 '' to code 68/D; but for ``æ'' they assign codes and keystrokes 81/Q (SIL Doulos/Manuscript/Sophia IPA), 60/< (Times IPA New), 64/@ (Tech Phonetic), and 233 (IPA Roman 1, IPA Plus).

Further languages

A number of other EC languages have been examined in the light of the SAMPA recommendations, and a short summary of the possible solutions for their special features is given here. For more details, see J. Wells, ``Computer-coded phonetic transcription'', Journal of the International Phonetic Association 17, No. 2, pp. 94--114, and the SAM Definition Phase Final Report (ESPRIT project 1541), January 1988.

Most of the minority languages of Europe such as Basque, Breton, Catalan, and Frisian can be transcribed adequately at a phonemic level without the need to change the principles of the present recommendation. Irish and Scottish Gaelic, however, require a decision for coding the palatalised (or ``slender'') consonants and the ``double'' nasals and laterals. Scottish Gaelic also has a back unrounded vowel series which does not occur in other EC languages. Welsh requires a solution for the voiceless alveolar lateral, represented in the orthography as ``<ll>''.

We should like now to explore whether it would be suitable to extend SAMPA for application to other languages, including Chinese, and if so how.

The question of Chinese has arisen because of the prospect of a wider collaboration on speech research between University College London and the Chinese Academy of Sciences.

Chinese already has what appears to be a satisfactory machine-readable phonetic notation in the form of Pinyin, the romanisation that has for some years been standard in the People's Republic (though not in Taiwan). Pinyin is an ingenious quasi-phonemic notation. It includes a number of unconventional digraphs, together with unconventional uses of individual Latin letters. Thus sh, ch, and zh represent retroflex/postalveolar consonants of a type that would normally be written in SAMPA as [S, tS, dZ]. Pinyin x, q, j represent a corresponding series of alveolopalatal consonants, IPA [ '013 , t '013 , d '136 ], for which SAMPA does not currently cater. Pinyin c represents [ts], y [j], and ng [ '070 ]. The close front rounded vowel [y] is written u where there would be no confusion with [u], but ü where this confusion might arise. (This last Pinyin character is not actually machine-readable in our sense.)

Continuing to use Pinyin for Chinese but SAMPA for other languages would mean that characters such as ``x, j'' would have different meanings in different languages (``x'' = alveolopalatal fricative, or velar fricative; ``j'' = alveolopalatal affricate, or palatal approximant). But this is perhaps no worse than the ``comparative'' differences already present in the interpretation of some symbols (see above). The Pinyin notation ``i'' already covers a remarkable range of allophonic possibilities (including an r-coloured back vowel in shi and a slightly fricative central vowel in si). Are Chinese speech technologists happy with this degree of phonemic abstraction?

Tone is shown in Pinyin (if indeed it is shown) by superscript accent marks, thus ma, má, ma, mà. These are not machine-readable in the SAMPA sense. The corresponding SAMPA tone-marks would be /''ma, 'ma, ` 'ma, ` ma/. However these SAMPA signs have not proved popular, and perhaps ought to be changed. For Chinese, we could perhaps consider instead the use of numerals, thus ``ma1, ma2, ma3, ma4''.

SAMPA: Present status

The following table presents the system agreed among the representatives of eight European countries engaged in European collaborative speech technology assessment research (SAM). It is currently being tested in the transcription and labelling of European multi-lingual databases.

SAMPA computer readable phoneme alphabet for European languages, with ASCII and IPA definitions (1990):

Table 1: Consonants

Table 2: Boundary and prosodic features

Table 3: Vowels

Table 4: Two character symbols

Table 5: Currently under discussion

Table 6: Currently used in French work

The phonemic notation of individual languages

This section provides a brief outline of the phonemic distinctions in the languages of the eight countries engaged in the initial phase of the SAM project by providing example words for the use of each phonemic symbol. Information is also provided for the languages of Spanish, Portuguese and Greek considered additionally.

Danish

Consonants

The plosives are /p, b, t, d, k, g/:

The fricatives are /f, s/:

The approximants are /v, D, j, h/:

The nasals are /m, n, N/:

The liquids are /l, R/:

Stød is symbolised by ``?'' and may be found in syllables containing a long stressed vowel, or a short stressed vowel, or a short stressed vowel followed by a voiced consonant, e.g. pæu --- /pE:?u/, peu --- /pEu?/

Vowels

The vowel system chosen for broad phonetic transcription is: /i, e, E, a, A, y, 2, 9, u, o, O, @/, with all vowels except @ occurring with a length distinction: /i:, e:, E:, a:, A:, y:, 2:, 9:, u:, o:, O:/.

The unrounded front vowels are exemplified in the following:

The central vowels are:

The rounded front vowels are:

The back vowels are:

Diphthongs.

The falling diphthongs may be most economically analysed phonemically as vowel plus /j/, /v/, or /r/, but for the broad phonetic representation within SAMPA they are analysed as vowel plus /i/, /u/ or /Q/, for example:

Dutch

Consonants

The plosives are /p, b, t, d, k/, (/g/):

The fricatives are /f, v, s, z, x, h/, (/G/):

The sonorants (nasals, liquids and glides) are /m, n, N, l, r, w, j/:

Vowels

The Dutch vowels fall into two classes, ``checked'' (not occurring in a stressed syllable without a following consonant) and ``free''.

The checked vowels are /I, E, A, O, Y, @/:

The free vowels comprise four monophthongs /i, y, u, a:/, three ``potential diphthongs'' /e:, 2:, o:/, and three ``essential diphthongs'', /Ei, 9y, Au/, exemplified as follows:

There are also six vowel sequences which are sometimes described as diphthongs:

Several marginal vowel phonemes are only found in loanwords:

English

Consonants

There are six plosives /p, b, t, d, k, g/:

There are two phonemic affricates /tS/ and /dZ/:

There are nine fricatives /f, v, T, D, s, z, S, Z, h/:

The sonorants are three nasals /m, n, N/, two liquids /r, l/ and two sonorant glides /w, j/:

Vowels

The English vowels fall into two classes, traditionally known as ``short'' and ``long'' but better described as ``checked'' (not occurring in a stressed syllable without a following consonant) and ``free''.

The checked vowels are /I, e, {, Q, V, U/:

There is a short central vowel, normally unstressed:

The free vowels comprise monophthongs and diphthongs, although no hard and fast line can be drawn between these categories. They can be placed in three groups according to their final quality /i:, eI, aI, OI/, /u:, @U, aU/, /3:, A:, O:, I@, e@, U@/. They are exemplified as follows:

The vowels /i:/ and /u:/ in unstressed syllables vary in their pronunciation between a close [i] and a more open [i] (close [u] --- more open [u]). Therefore, it is suggested that /i/ and /u/ be used as indeterminacy symbols.

French

Consonants

There are six plosives /p, b, t, d, k, g/:

There are seven fricatives /f, v, s, z, S, Z, j/. /j/ can be realised as a fricative or an approximant.

There are four nasals /m, n, J, N/, the last of which is only found in loanwords:

There are two liquids /l, R/ and two vowel glides /w, H/ (besides /j/):

Vowels

The vowel system comprises 12 oral vowels /i, e, E, a, A, O, o, u, y, 2, 9, @/, and 4 nasal vowels /e, a, o, 9/, exemplified as follows:

When they are functional, the load of the oppositions /a/--/A/, /e/--/9/, /e/--/E/, /o/--/O/, /2/--/9/ may be very low for certain speakers, and there is a tendency towards neutralisation. When they are not functional there is a strong tendency in unstressed syllables towards indetermination. ``Indeterminacy'' symbols have been agreed to cover occurrences of these phonemes or sounds

German

Consonants

There are six plosives /p, b, t, d, k, g/:

There are four phonemic affricates /pf, ts, tS/ and /dZ/, which occur in a few loanwords:

There are ten fricatives /f, v, s, z, S, Z, C, j, x, h/. /j/ is often realised as a vowel glide.

The sonorants are three nasals /m, n, N/, and two liquids /l, r/:

Orthographic <r> is realised phonetically in a number of different ways:

  1. As a dorso-uvular consonant --- a voiced or voiceless fricative, approximant, trill or flap. This should be represented as `R'.
  2. As an apico-alveolar consonant --- a trill, tap, or flap. This may be represented as `r' (e.g. <rein> --- raIn).
  3. As a vowel post-vocalically. This may be represented as `6' (see below).

Vowels

The vowels fall into three groups, ``checked'' (short), ``free'' (long), and two short vowels that only occur in unstressed position. The checked vowels are /I, E, a, O, U, Y, 9/:

There are 8 pure free vowels are /i:, e:, E:, a:, o:, u:, y:, 2:/ and three free diphthongs /aI, aU, OY/:

The unstressed ``schwa'' vowel is:

The vowel realisation of <r>, represented as 6, fuses with schwa, but it also follows stressed vowels, resulting in additional centring diphthongs:

Greek

Consonants:

plosives

affricates

fricatives

nasals

liquids

semivowel

(palatals)

connected speech phenomena

Italian

Consonants

There are six single and six geminate plosives /p, b, t, d, k, g/, /pp, bb, tt, dd, kk, gg/ as follows:

There are four single and four geminate affricates /ts, dz, tS, dZ/, /tts, ddz, ttS, ddZ/:

There are five single and four geminate fricatives /f, v, s, z, S/, /ff, vv, ss, SS/:

There are three single and geminate nasals /m, n, J/, /mm, nn, JJ/, three single and three geminate liquids /r, l, L/, /rr, ll, LL/ and two semi-vowels /j, w/:

Vowels

The vowel system comprises seven vowels /i, e, E, a, O, o, u/:

Norwegian

There are six plosives:

There are six fricatives:

There are five sonorant consonants (nasals, liquids, trills):

Vowels

There are 9 long vowels:

and nine short vowels:

There are seven diphthongs:

In addition there are important allophonic variants for which the transcription has been agreed:

In cases where the dental consonants do not change into retroflexes, they are transcribed using the separator sign (ASCII 45), e.g. /r-t/, /r-d/:

Portuguese

Consonants:

plosives

fricatives

nasals

liquids

Vowels and diphthongs

Spanish

Consonants:

plosives

affricates

fricatives

nasals

liquids

semivowels

Vowels:

Swedish

Consonants

There are six plosives:

There are six fricatives:

There are six sonorant consonants (nasals, liquids and semi-vowels):

Vowels

There are nine long and nine short vowels.

Long vowels (followed by a short consonant):

Short vowels (followed by a long consonant):

There are also two pre-r-allophones (long and short) of /E/ and /2/ (see below).

The following important allophonic variants occur in Swedish which require separate symbolic representation:

Levels of annotation and extension of SAMPA

SAMPA as a phonemic system

The present SAMPA system, which was provisionally agreed at the end of the Extension Phase, is defined as a system for phonemic transcription and annotation. This means that the symbols are used according to the analysis of distinctive sound oppositions within each language. Thus, although their relation to sound category symbols of the International Phonetic Alphabet (IPA) is given, they are symbols of intra-language convention, and do not have an exact language-independent phonetic (auditory or acoustic) equivalence, nor do they represent a single sound within a language.

For example, the symbol /t/, used in the transcription of all 8 partner languages, could represent an unaspirated sound in French or Italian, a strongly aspirated sound in German or English, and an affricated sound in Danish. In English the /t/ can also stand for an unaspirated sound (following /s/) or the more usual aspirated sound. Vowel symbols often represent widely diverging sounds from one language to another; /{/ in Danish is very different from /{/ in English, for example.

This basically phonemic, or sound-system-orientated (systematic) function of SAMPA means that a general extension of the SAMPA coding system to allow fine phonetic differentiation of speech sounds is not possible. There are, however, examples in the SAMPA list of symbols which can be used to represent non-distinctive differences within a language, e.g. `r' and `R' for regionally dependent free variants, and some important allophonic variants are allowed for (e.g. in Swedish and Norwegian). Also, auditory transcription (French ``notation'') is meant to be a ``broad phonetic'' representation of the actual utterance, including elisions and assimilations (inasfar as these can be represented with the phonemically orientated SAMPA inventory) rather than the strictly phonemic string of the citation form.

One area in which an extension of SAMPA is possible, indeed probable, is prosody. Certain ``Boundary and Prosodic Features'' have been agreed preliminarily, but their use has only been illustrated in the English EUROM.0 transcriptions. The considerations of prosodic description in a multi-lingual context may well reveal the need to modify and extend SAMPA. The work on prosodic description may also conclude that a separate prosodic annotation tier is necessary.

Detailed phonetic or acoustic annotation

For finer segmental annotation of speech recordings, three basically different approaches are offered for discussion. All three approaches require a separate annotation tier, but the labels are temporally defined by the location of the phonemic segment boundaries (phonemic markers in the case of centre labelling).

  1. The SAMPA symbols are given language-independent sound values (IPA equivalent values) and modified by means of agreed diacritic codes to reflect fine phonetic detail.

    Advantages: No new segmentation or marker placements would be required.

    Disadvantages:

    1. Different symbols would sometimes be required at the phonemic and phonetic levels, particularly for vowels. E.g. Danish /{/ might have to be represented by phonetic [E]; English /{/ might have to be represented by phonetic [a] or even [A], depending on regional accent.

    2. Diacritic symbols would have to be agreed for all partner languages.

    3. ASCII coding on one keyboard would possibly not be sufficient for the necessary IPA symbols and diacritics, and there would be little or no mnemonic value in the choice of many symbols.

  2. The SAMPA phonemic values are retained for each language, and the phonemic segment is subdivided into acoustically quasi-homogeneous elements. E.g. /k/ may contain a partially voiced closure, a clear burst, and a period of aspiration prior to the vowel onset. Note that this approach is an acoustic-event labelling and is used in a similar way at CERFIA, IES and UCL. The following characterisation retains the primary symbol as ``pointer'' to the phonemic identity of the utterance:

    Advantages: The acoustic realisation of each phonemic segment is defined in greater detail than is possible even in narrow phonetic transcription, where, for example, a partially voiced closure cannot be easily represented. gif

    Disadvantages:

    1. New segment markers have to be set.
    2. The system can only apply to approaches that recognise the need to define segment boundaries (however arbitrary they may be theoretically).

    Note: It must be pointed out that the two-symbol representation given above is redundant, in that the acoustic-event categories are common to phoneme classes rather than individual phonemes; i.e. pc, tc, and kc would all be a period of voiceless closure and therefore not require the place specification. Also, if the phonemic category is specified in a different tier of annotation, it is recoverable, and may be used for a database search, e.g. with a view to developing a set of rules covering the possible ``internal'' structures of stretches of signal associated with a particular phoneme. At present, some partners need to retain the ``phonemic pointer'' in order to derive the phonemic label file from the lower level acoustic-event file.

  3. A third approach, favoured by the linguistic group at ICP (Grenoble) recognises transitional phases between areas marked as optimally representative of a particular phoneme category. The finer labelling requires the delimitation of the (centre-marked) optimal area, thus also delimiting the area of co-articulation.

    Advantages: The theoretically doubtful ``changeover point'' from one ``phoneme'' segment to another is avoided, and areas of indeterminacy are identified.

    Disadvantages: New markers have to be set.

    Each of these approaches would provide an annotation which is closer to the (acoustic-) phonetic realisation of the utterance than the phonemic SAMPA labels. For the development of speech knowledge in general, and for the definition of rules describing the structure of continuous speech in particular, the use of a more detailed annotation is essential. It is the symbolic bridge between measurable acoustic parameters and abstract phonological categories. Which approach is selected for more detailed annotation within the SAM project depends on the use to which it will be put. Essentially, the closer a symbolic representation comes to significant acoustic events (whereby ``significant'' is an application-dependent term), the more useful it will be in speech-knowledge acquisition and rule development. Both synthesis and recognition assessment can only gain.

References

SAM 1988. ESPRIT Project 1541: Definition Phase Final Report ``Multilingual Speech Input/Output Assessment Methodology and Standardisation''. London: University College London.

SAM 1989. ESPRIT Project 1541: Extension Phase Final Report. London: University College London. VI.2: First appraisal of SAMPA.

Wells, J.C., 1987. ``Computer-coded phonetic transcription'', Journal of the International Phonetic Association 17:2, pp. 94--114.

)



next up previous contents
Next: SAM label file Up: SAMPA computer readable Previous: SAMPA computer readable



WWW Administrator
Fri May 19 11:53:36 MET DST 1995