Next: Influencing factors Up: Assessment of speaker Previous: Presentation

A taxonomy of speaker recognition systems

Any speaker classification or recognition system functions under 3 modes :

A training mode, during which speaker or speaker class models are built and stored in the system, as reference patterns. Alternative terms : learning mode, registration, enrolment, subscription...
A test mode, during which the system performs the recognition of an utterance to be identified (or verified). Alternative terms : recognition mode, trial mode, operating mode,...
An untraining mode, during which a speaker or speaker class model is removed from the list of reference patterns. Alternative terms : unlearning mode, unregistration, subscription cancellation,...

Naturally, a given speaker can not be identified or verified in the test mode before he has been registered during a training phase. However, training is usually incremental in several ways. Firstly, a speaker recognition system is usually operational before all possible speakers are referenced, and new users can be added while the system is already functioning in test mode for those who are registered. Secondly, some system implementations use speech material obtained during a test phase to update training references (when the recognition decision is judged reliable enough).

Unlike training and test, untraining does usually not require an active participation of the speaker being unregistered.

Task typology

Speaker identification versus speaker verification

As we briefly mentioned it in the introduction, speaker recognition covers 2 different areas : on the one hand, speaker identification, on the other hand, speaker verification. As Doddington describes it [Doddington 1985], the goal of a speaker identification task is to classify an unlabeled voice token as belonging to one of a set of n reference speakers, whereas the speaker verification task is to decide whether or not the unlabeled voice belongs to a specific reference speaker.

Speaker identification is therefore a 1 out of n decision, in the case of closed-set identification, the result of which is an identity assignment to an applicant speaker. However, in practical applications, open-set speaker identification requires an additional outcome of rejection, corresponding to the possibility that the unlabeled speech token does not belong to any of the registered speakers. In such circumstances, the applicant speaker is called an impostor.

Speaker verification can be viewed as a particular case of open-set speaker identification, corresponding to n = 1. The speaker verification system takes a test voice sample and a claimed identity as input, and returns a binary decision : acceptance if the applicant speaker is considered to be the genuine speaker or rejection if he is considered to be an impostor (as regards the claimed identity).

Conversely, open-set speaker identification can be understood as a step of closed-set speaker identification, followed by a step of speaker verification, the latter using the identity assigned by the former, as the claimed identity.

Related tasks

Beyond this major distinction between identification and verification, other related tasks can be mentioned.

Speaker matching, that is to choose a speaker in a closed-set of references which is most similar to a current speaker, even though it is known in advance that the applicant speaker is not a registered speaker. This appears to be a particular case of speaker cluster selection, where each cluster consists of one speaker only.
Speaker labeling, when the identity of speakers taking part to a conversation are registered, and the goal is to localise when do their successive interventions begin and end.
Speaker alignment, when the identity and order of speakers taking part to a conversation are known, and the goal is to localise when each of their interventions begins and ends.
Speaker change detection, when the goal is to detect a change of speaker along a speech stream.

Types of errors

For closed-set speaker identification, a misclassification error occurs when a registered speaker is mistaken for an other registered speaker (the mistaken speaker).

For speaker verification, two types of errors must be distinguished : false rejection when a genuine speaker is rejected and false acceptance when an impostor is accepted as the speaker he claimed he was (the violated speaker).

For open-set identification, the three types of errors can occur. Usually, misclassifications and false acceptances are considered as equally harmful, and therefore merged together. However, these 2 types of errors may not have the same consequences in some practical applications.

Levels of text-dependence

Another feature which is classically used to specify a speaker recognition system is its level of text-dependence, i.e the constraints on the linguistic material imposed to a test utterance. A main distinction is conventionally set between text-dependent systems and text-independent systems. Though this basic distinction is not accurate enough to cover the range of practical possibilities, we give below a definition of these 2 terms, according to the usage met in the literature. To simplify, in text-dependent systems, the linguistic content of the training and test material are totally identical, while in text-independent systems test utterances vary across trials (at least in terms of word order).

However, a deeper study of the various strategies used in practice shows that at least 5 levels of text-dependence should be distinguished. Two of them resort to text-dependent approaches, but can be opposed on the use of either a personal password or a common password. The other 3 can be viewed as several variants of text-independent approaches, using either fixed words in a random order ( fixed-vocabulary systems), a specific linguistic event, wherever it occurs ( event-dependent systems), or a completely unrestricted text ( unrestricted text-independent systems).

Interaction mode with the user

For text-dependent systems, the speech material that the user must pronounce in front of the system is a priori determined during the registration. While common-password systems have absolutely no flexibility in the choice of the linguistic material, personal-password systems can enable some text-customization, in particular the possibility for the registered user to change his voice password.

For text-independent systems, at least 3 modes of interaction with the user can be distinguished : text- and voice-prompted systems impose the (unpredictable) linguistic material to the user, whereas unprompted systems use totally spontaneous speech. In parallel with speaker recognition, prompted systems do explicitely or implicitely some kind of speech recognition, in order to check that the applicant speaker has really uttered what he was asked to say.

Definitions

Applicant speaker : The speaker using a speaker recognition system at a given instant. Alternative terms : current speaker, test speaker, unknown speaker, customer, user,...

Registered speaker : A speaker who belongs to the list of registered users for a given speaker recognition system. For speaker classification systems, we propose the term conform speaker to qualify a speaker who belongs to one of the classes of speakers for a given speaker classification system. Alternative terms : reference speaker, valid speaker, authorised speaker, subscriber, client,...

Genuine speaker : A speaker whose real identity is in accordance with the claimed identity. By extension : a speaker whose actual character and claimed class are in accordance. Alternative terms : authentic speaker, true speaker, correct speaker,...

Impostor (speaker) : In the context of speaker identification, an impostor is an applicant speaker who does not belong to the set of registered speakers. In the context of speaker verification, an impostor is a speaker whose real identity is different from his claimed identity. Alternative terms : impersonator, usurpator,... For speaker classification tasks, this concept is better rendered by the term : discordant speaker.

Identity assignment : Decision outcome which consists in attributing an identity to an applicant speaker, in the context of speaker identification. For speaker classification, the term class assignment should be used instead.

Acceptance : Decision outcome which consists in responding positively to a speaker (or speaker class) verification task.

Rejection : Decision outcome which consists in refusing to assign a registered identity (or class) in the context of open-set speaker identification (resp. classification), or which consists in responding negatively to a speaker (class) verification trial.

(Speaker) misclassification : Erroneous identity assignementto a registered speaker in speaker identification.

False (speaker) rejection : Erroneous rejection of a registered speaker or of a genuine speaker in open-set speaker identification or speaker verification respectively.

False (speaker) acceptance : Erroneous acceptance of an impostor in open-set speaker identification or in speaker verification.

Mistaken speaker : The registered speaker owning the identity assigned erroneously to an other registered speaker by a speaker identification system.

Violated speaker : The registered speaker owning the identity assigned erroneously to an impostor in open-set speaker identification system. The registered speaker owning the identity claimed by a successful impostor, in speaker verification.

Text-dependent speaker recognition system : A speaker recognition system for which the training and test speech utterances are composed of the exact same linguistic material, in the same order (typically, a password).

Text-independent speaker recognition system : A speaker recognition system for which the linguistic content of test speech utterances varies across trials.

Personal-password speaker recognition system : A text-dependent speaker recognition system for which each registered speaker has his own voice password. Common-password speaker recognition system : A text-dependent speaker recognition system for which all registered speakers have the same voice password.

Fixed-vocabulary speaker recognition system : A text-independent speaker recognition system for which test utterances are composed of words, the order of which varies across speakers and sessions, but for which all the words were pronounced at least once by the speaker when he registered to the system.

Event-dependent speaker recognition system : A text-independent speaker recognition system for which test utterances must contain a certain linguistic event (or class of events) while the rest of the acoustic material is discarded. This approach requires a preliminary step for spotting and localising the relevant event.

Unrestricted text-independent speaker recognition system : A text-independent speaker recognition system for which no constraints apply to the linguistic content of the speech material.

Text-prompted speaker recognition system : a speaker recognition system for which, during the test phase, a written text is prompted (through an appropriate device) to the user, who has to read it aloud.

Voice-prompted speaker recognition systems : a speaker recognition system for which, during the test phase, the user has to repeat a speech utterance, which he listens to through an audio device.

Unprompted speaker recognition systems : a speaker recognition system using totally spontaneous speech, i.e for which the user is totally free to utter what he wants, or for which the system has no control over the speaker.

Examples

In this section, we give examples of well-known speaker recognition systems which can be found in the literature, in order to illustrate the taxonomy described above.

Text-dependent systems

Among examples of text-dependent systems, the Bell Labs system, reported by Rosenberg [Rosenberg 1976] and improved by Furui [Furui 1981] is tested by the latter under the following protocol :

Several [six] kinds of utterance sets were used to evaluate [the] system. =1
, ; ( ) . Two all-voiced sentences were used in the recordings. The males used the sentence, "We were away a year ago" and the females used the sentence, "I know when my lawyer is due."

The first 5 utterance sets are composed of male speakers, while the last one is composed of female speakers. Performances are reported for speaker verification experiments on each set. Following our terminology, these experiments simulate a common-password text-dependent speaker verification system. Here, the password is an entire sentence. As Rosenberg notes it, to justify the use of a text-dependent system in practical applications :

For many applications, the speakers are expected to be cooperative so that a prescribed text is perfectly feasible.

The use of a prescribed text has also the advantage that it does not need any prompting, but the drawback that it may be forgotten by the user. As discussed in a next example, a second drawback of text-dependent systems is the possibility for impostors to use pre-recorded speech.

As an example of a personal-password text-dependent speaker verification, one can mention a new service offered by the American telephone operator SPRINT. For this service, the user must speak his telephone card number through the phone, in order to have his home bill charged directly for the call he is willing to make. The system identifies the claimed customer by recognising the sequence of digits, and then verifies, on the very sequence of digits, the match between the actual user and the assumed customer. Here, the sequence of digits has a double function : a mean for customer identification and a personal voice password for speaker verification.

Fixed-vocabulary systems

An other very popular speaker verification systems was developped by Doddington, at Texas Instruments, in the early 70s. Here follow excerpts of the description given by the author [Doddington 1985] :

To use the system an entrant first opens the door to the entry booth and walks in, then he identifies himself by entering a user ID into a keypad, and then he repeats the verification phrase(s) that the system prompts him to say. If he is verified, the system [ ... ] unlocks the inside door of the booth so that he may enter into the computer center. If he is not verified, the system notifies him by saying "not verified, call for assistance".
Verification utterances are constructed randomly to avoid the possibility of being able to defeat the system with a tape recording of a valid user. An innocuous four-word fixed phrase structure is used, with one of sixteen word alternatives filling each of the four word positions (see Table ).

Table: Verification Phrase Construction for the TI Operational Voice Verification System (after Doddington)

An example verification utterance might be "Proud Ben served hard". These utterances are prompted by voice. This is thought to improve verification performance by stabilizing the pronunciation of the user's utterance.

Therefore, the TI system turns out to be a voice-prompted fixed-vocabulary speaker verification system, the claimed identity being input as a personal identification number on a keypad. Doddington's excerpt illustrates well the motivations behind the voice-prompted fixed-vocabulary approach : the relative randomness of the verification utterances protects against impostors using pre-recorded speech, while the use of voice prompts tends to control the reproductibility of the user's pronunciation. However, it must be noted that voice-prompting may also neutralise some of the speaker specifity (in particular prosodic factors), owing to an inconscious mimicry of the prompt. At the same time, text-prompting has the drawback of requiring a specific device, such as a screen, which is not always possible to implement.

The experiments reported by Soong and Rosenberg [Soong 1987] where sequences of digits are used for speaker verification is another example of a fixed-vocabulary system.

Unrestricted text-independent systems

Unrestricted text-independent speaker recognition is usually considered as desirable for several reasons. Even if the user does not have the initiative of the text to utter, prompted systems are less likely to be defeated by a recorded voice, as the linguistic material is virtually unpredictable. For unprompted systems, identification or verification can take place unobstrusively, during a telephone transaction, for instance. Moreover, unprompted approaches do not require the speaker to be actively cooperative.

Here is the general structure of a text- (or voice-) prompted unrestricted text-independent system, as described by Furui [Furui 1994 !] :

The recognition system prompts each user with a new key sentence every time the system is used, and accepts the input utterance only when it decides that the registered speaker has uttered the prompted sentence. [ ... ]. This method not only can accurately recognize speakers but also can reject utterances whose text differs from the prompted text, even if it is uttered by the registered speaker.
[During registration], since the text of training utterances is known, these utterances can be modeled as the concatenation of [speaker-independent] phoneme models, and these models can be automatically adapted [to the new registered speaker]. In the recognition stage, the system concatenates phoneme models according to the prompted text [i.e a speaker-specific model and a speaker-independent model]. If the likelihood of both speaker and text is high enough, the speaker is accepted as the claimed speaker.

Note here that the fundamental difference between the system described above and a fixed-vocabulary system is the use of subword speech units (here, phonemes) which allow to build speaker-specific models of test words (or sentences) which were not pronounced during the registration phase. Note also the use of an explicit step of speech recognition.

In opposition to prompted systems, here is the example of an experiment in unprompted speaker recognition, as reported by Gish [Gish 1986], concerning the ISIS system from BBN :

[ ... ] We wish to identify an unknown speaker, from an utterance, [ ... ] knowing that the utterance was made by one of a set of M possible speakers. We have available training data for each of the M speakers that consists of speech from one or more telephone calls, all distinct from the test telephone call. The text of all utterances is assumed to be unknown.

Here, the protocol described is unprompted unrestricted text-independent closed-set speaker identification. Note also the multi-session character of the experiment, i.e that the training and test material have been recorded through different channels, probably on different days.

Next: Influencing factors Up: Assessment of speaker Previous: Presentation

WWW Administrator
Fri May 19 11:53:36 MET DST 1995