next up previous contents
Next: Scoring Procedures Up: Assessment of speaker Previous: A taxonomy of

Influencing factors

 

Many factors influence the performance of a speaker recognition system. In this section, we list the most obvious ones and stress their role on the general performance of speaker recognition systems. In particular, we discuss which are the most relevant aspects which should be explicitely taken into account and reported on when defining and describing an evaluation protocol, and how they should be handled and expressed.

Speech quality

 

A major factor influencing the performance in speaker recognition is the speech quality, in particular the bandwidth of the recording, the location of the speech acquisition system (anechoic chamber, studio, office, street,...), the type of microphone used, and the transmission channel (direct recording, telephone transmission, radio transmission, type of signal coding,...).

In practice, it is essential that the description of speech quality reports accurately the general characteristics, but also underlines what are the variable factors and those that remain constant, between several trials of a same user. For instance, speaker recognition through the telephone is generally more difficult than speaker recognition using studio recordings. But, two evaluations using telephone speech may give different results if for the first one, each speaker always calls from the same place using the same telephone, while for the second one he is asked to call from different locations. Naturally, the latter is likely to give better results than the formergif.

Temporal drift

 

The characteristics of a voice vary over time, depending on how tired the speaker is along the day, how stressed he is, what his mood is, whether he has a or not, etc... Moreover, it was often noted that the behaviour of users modifies while they are getting accustomed to a system. These trends can be gathered under the term temporal drift.

Temporal drift usually affects significantly the performance of a speaker recognition system. Intra-speaker variability within a single recording session is usually much smaller than inter-session variability. In practice, performance levels deteriorate significantly a few days, or even a few hours after registration, as compared to those obtained with contemporaneous speech, i.e when test utterances are pronounced immediately after the training phase is terminated. A partial solution to temporal drift consists in using training material which is gathered over several sessions : as the collected data are more representative of the intra-speaker variability over time, more robust speaker models can be built. However, this approach makes the registration process heavier.

When the targeted application is intended to operate along time, it is necessary to design an evaluation experiment for which test material was recorded in several sessions, separated from each other by at least one day, and covering a reasonable time-span (at least a month)gif. When multi-session recordings are available, the training material should be chosen so that it corresponds to the first recording session (or sessions, for multi-session training). Conversely, the material of a given session should never be split between training and testgif, as this would lead to an unrealistic protocol.

When these constraints are fulfiled, the number of training and test sessions, and the time-span covered by both phases should be explicited. Note that the number and time-span of training sessions has an influence on performance levels and on the user acceptability, whereas the number and time-span of test sessions have only an impact on the statistical validity of the evaluation results.

Speech quantity and variety

 

An other fundamental factor having an effect on the performance level is the speech quantity, i.e the amount of speech material used during the registration phase, and during each user trial phase. Usually, the level of performance increases with the speech quantity, but stabilises after a certain amountgif. As the user's acceptability generally drops when the training phase (and a fortiori the test phase) lasts too long, a compromise has to be found when the performance improvement is judged marginal.

When registration consists of several training sessions, it is the total training speech quantity, i.e the speech quantity per training sessions multiplied by the number of training sessions, which has the most clear impact on performance. When tests are carried out with speech material from several sessions, the relevant factor is the speech quantity per test session.

It may also be important to distinguish the quantity of speech uttered by the speaker from the quantity of speech actually used for training and testing. This distinction is mostly appropriate for event-dependent systems and mainly for diagnostic analysis. In this case, the (average) proportion of speech actually used can be given for information, for the training mode and for the test mode. However, the absolute quantity of used speech should not be reported, in order to avoid ambiguities.

A related factor to speech quantity is speech variety : for a given speech quantity, it is usually more efficient to cover a wider range of linguistic phenomenagif. In the lack of a universal quantitative measure of linguistic coverage, a qualitative description of the linguistic material is the only way of specifying this aspect.

Speaker population size and typology

 

In this section, we indicate in what way, and to what extent the population composition, in terms of size and typology, can affect the performance of a speaker recognition system and how it should be taken into account when designing an evaluation experiment.

When the goal is closed-set speaker identification, it is clear that the complexity of the task increases with n, the registered speaker population size. However the proportion of men and women in the population also has a direct influence, as same-sex confusions are usually much more likely than cross-sex errorsgif. If additional geographical, physiological, and even psychological and sociological information seems particularly relevant or clearly specific of the tested population, the experimentator should be aware of it and explicit it.

For what concerns speaker verification, the level of performance does not depend on the registered speaker population size, as for each trial, the complexity of the task corresponds to an open-set speaker identification with n = 1. A large representative population of registered speakers will only guarantee a higher statistical validity of evaluation results, whereas general conclusions will be less reliable with a small specific population.

However, a relevant issue for speaker verification (and open-set identification) is the number and typology of pseudo-impostorsgif, i.e speakers used to model impostors during the registration phase. With more pseudo-impostors, the modeling of imposture is usually more accurate. The way pseudo-impostors are selected, and in what they differ from authorised users is also essential.

In general, each registered speaker has a corresponding impostor model, which represents real impostors who could claim his identity. The impostor model can be common to all registered speakers, or specific to each authorised user, if the pseudo-impostor population varies across subscribers. Pseudo-impostors can be chosen within the population of registered speakers, or originate from an external population. We will use the term pseudo-impostor bundlegifto refer to the group of speakers who have been used to build the impostor model of a given registered speaker.

From a practical point of view, when impostor models are built from other registered speakers, the recording burden is lighter, but the impostor models may be less representative of imposture in general. If an additional population of external speakers is used, the number of additional pseudo-impostors, their population typology, as well as the speech quantity and number of sessions required from each of them should be specified.

Incidentally, for the evaluation of a speaker verification system, a test impostor should not be part of the pseudo-impostor bundle of the speaker he is claiming to be, as the real rejection abilities of the system may be over-estimated otherwise. On the opposite, there is no objection in having a registered speaker belonging to his own pseudo-impostor bundle, as is the case when the whole registered population is used to build a common impostor model.

Speaker purpose and other human factors

 

The motive for which a speaker in using a system also influences considerably its performance profile. We first describe a possible typology of applicant speakers as regards their objective. Then we mention other relevant human factors.

When the user's goal is conform to the system's purpose, a cooperative (registered) speaker can be defined as an authorised applicant who is willing to be identified or as a genuine speaker who intends to be verified positively. Their counterparts, in the impostor population would be a well-intentioned impostor, i.e an impostor having the goal of being rejectedgif.

When the user's goal and the system's purpose are inverse, an uncooperative (registered) speaker knows that he is being verified but wants the system to reject himgif. For instance, an uncooperative speaker is likely to use natural or artificial voice masking in order to remain anonymous. On the opposite, an intentional impostor has the clear goal of being identified or verified though he is not registered ( violation), or to be identified as somebody else ( usurpation).

Here, a distinction must be made within intentional impostors depending on whether they previously have or have not been in contact with the voice of the authentic user whose identity they are claiming. We propose the term acquainted impostor to qualify an intentional impostor who has some knowledge on the voice of the authorised speaker, by opposition to unacquainted impostors, when the impostor has never been in contact with the genuine or authentic user. The degree of success of an acquainted intentional impostor will ultimately depend on his imitation skills.

The term casual impostor is often used to qualify speakers who are used as impostors in an evaluation, but who were not recorded with the explicit instruction of trying to defeat the system. In the same way, the term casual registered speakers can be used to refer to a population of registered speakers who have not received an explicit instruction of succeeding in being identified or verified positivelygif.

Here again, variants appear depending on the way the experimentator chooses the claimed identity of a casual impostor, in a verification experiment. A casual impostor can be tested against all registered users systematicallygif, against all other registered speakers of the same-sex, against all other registered speakers of the opposite sex, against k registered speaker chosen at random, against the k nearest neighbours in the registered population,...

Whereas, in a first approximation, a population of casual registered speakers may be relatively representative of a population of cooperative registered speaker, no test protocol using casual impostors can accurately approximate the behaviour of intentional impostors. In practice, a real impostor could try to vary his voice characteristics for a fixed identity along successive trials, until he succeeds to defeat the system, gives up, or until the system blacklists the genuine user. Or he may try as many registered identities as he can with his natural voice or a disguised voice, until he succeeds, gives up, or until the police arrives !

However, most laboratory evaluations use speech databases which have usually not been recorded in a real-world situation. Therefore, they do not model accurately, neither cooperativeness, nor intentional imposture and the impostor speakers are casual impostors. A frequent practice is to use an exhaustive attempt test configuration, for which each impostor is successively tested against each registered speaker. We suggest to adopt a slightly different approach. Two distinct experiments should in fact be carried out. One for which each casual impostor utterance is tested against all registered identities of the same sex, and a second one for which each casual impostor utterance is tested against all registered identities of the opposite sex. The first experiment allows to estimate the rejection ability of a system towards unacquainted intentional impostors how would know the sex of the genuine speaker, even though casual impostors are almost well-intentioned impostors. The second experiment allows to check that the system is really robust to cross-sex imposturegif. We will refer to these configurations as a selective attempt against all same-sex speakers and selective attempt against all cross-sex speakers respectively. In a first approximation, the proportion of successful violations does not depend on the number of registered speakers.

In addition, testing each impostor utterance against its nearest neighbour in the registered population can give a hint of the system's robustness against intentional imposture. However, the result will be directly influenced by the registered speaker population size. Therefore, this approach is only meaningful in the framework of a comparative evaluation, on a common database. This approach can be qualified as a selective attempt towards the nearest registered neighbour. Other selective attempts are possible, such as towards the same age class speakers, for instance.

To summarize, registered speakers should be qualified as cooperative, casual or uncooperative, whereas a distinction should be made between well-intentioned, casual or (acquainted / unacquainted) intentional impostors. Only field data can provide realistic instances of user behaviour.

Additionally, the general motivation and behaviour of the users can have an impact on the performance of a system : for instance, what are the stakes of a successful identification or verification, the benefits of an imposture, the feeling of users towards voice technology in general,... In evaluation, all these aspects influence the motivation of the user, and therefore the interpretation of the results.

Recommendations

 

The description of an evaluation experiment or an assessment protocol, concerning a speaker classification or recognition system should explicitely report on the following items :

Example

Here is what could be the description of an evaluation experimental protocolgif.

The following protocol was designed to estimate the performance of a speaker verification system for the protection of personal portable telephones. The principles of the targeted security system is a personal-password text-dependent speaker verification system. Before a user can place a call on his portable phone, he is asked to utter his identity. i.e his name and surname. The compatibility between the speaker and the authorised owner is checked locally, and in case of acceptance, the speaker is allowed to dial his number.

To simulate this application, the following experimental protocol was set up. A group of portable phone owners were provided with a (slightly modified) miniature tape-recorder (the size of a dictaphone), and were asked to record their name and surname before they placed a call on their phone, except if they had already done so during the previous 3 hours. To make sure that some users would not record all their utterances consecutively, a temporisation was implemented in the tape-recorder, so that a time-interval of 3 hours had to be respected between 2 activations of the record function. In counterpart to a user's participation, his subscription to the portable phone service was paid for, for 6 months. In practice, the 6 month subscription was refunded to a user when he brought back a recorded tape containing 100 recordings. This number corresponds approximately to 1 session every other day over 6 months. In reality, the average time after which a tape was returned was 4.2 months.

Once a tape and a tape-recorder were returned, the tape's content was digitized at a sampling frequency of 16 kHz, and the data were segmented automatically (a beep had been internally recorded on the tape each time the "stop" button was pressed). The speech material was not verified exhaustively, but a speech activity detector was used to discard utterances that were composed of silence only. In the average, 97 % utterances were kept. Silent signal portions lasting longer than 0.2 seconds were removed automatically. The typical bandwidth of the tape-recorder's microphone is 150-6000 Hz, which is within the tape's bandwidth. All tapes were of the same trade mark, and their noise level was judged negligible. Despite the fact that, for a given speaker, the microphone and the tape characteristics remained constant for all recording sessions, the data collection protocol can be considered as realistic for the targeted application.

The first 5 recordings for each speaker were used as training material, whereas the remaining ones were used as test material (92, in the average). The average registration time-span was estimated to be (5 / 97) 4.2 months 6.5 days, which may be an over-estimate of the actual time-span, as users were probably more bound to record often their voice at the beginning of the experiment. In the same way, the average operation time-span was considered to cover approximately 4 months.

An initial population of 188 persons accepted to take part to the experiment, but 19 of them never returned the recording device, either because they lost it, or because they lost interest in the experiment. Additionally, 7 tape-recorders and 3 tapes got deteriorated during the 6 month time-span. Altogether, only 159 different speakers were thus taken under consideration as registered speakers, among which 92 were male speakers (i.e 58 %). All of them were adults over 18. Nothing else about their profile was studied, but they are likely to correspond to a relatively rich fraction of the population, since they can afford a portable phone.

In this database, a speaker utters his name and surname in 0.8 seconds in the average, but this figure varies significantly from one person to the other. The linguistic content of the speech material can not be specified other than exhaustively.

For impostor modeling, we used all recordings corresponding to the registration phase for all speakers, which we pooled together to form a speaker-independent text-independent model. We then derived an impostor model for each registered speaker as the representation of the user's training pronunciations according to the speaker-independent model. In other words, all registered speakers were part of the pseudo-impostor bundle of a given speaker, including this very speaker.

A set of 6 professional imitators (4 male, 2 female) where then asked to simulate acquainted intentional test impostors. For each registered speakers of the same sex, they were provided with the tape-recorder of the genuine user, and could listen as much as they wanted to the training material of this user. Then, they were asked to produce 5 imitated utterances of the speaker saying his name and surname. These imitations were recorded on the user's tape-recorder, at the end of the user's tape. Given the experimental protocol, it was not possible to provide the imitators with any feed-back concerning their success or failure to break the system. Altogether, each male imitator recorded approximately 5 92 = 460 impostor tests against registered male speakers, and each female speaker produced about 5 67 = 335 impostor tests against registered female speakers. All imitators were paid for their work. The imitated speech followed the same processing as the genuine one.

For the evaluation of the system performance, each authentic test utterance was tried with the genuine identity (159 92 = 14628 authentic trials) and each imitated utterance was tried against the targeted identity (4 460 + 2 335 = 2510 impostor trials).

We leave to the reader the pleasure of tracking the unavoidable experimental biases that remain in this imaginary experimental protocol, and how they could be circumvented.



next up previous contents
Next: Scoring Procedures Up: Assessment of speaker Previous: A taxonomy of



WWW Administrator
Fri May 19 11:53:36 MET DST 1995