Next: Complementary assessment tools Up: Assessment of speaker Previous: Influencing factors

Scoring Procedures

The design and implementation of an experimental evaluation protocol is usually intended to provide an estimation of a system performance. Beside the overall figures of error rates, other relevant quantities can be derived from the analysis of the test results, for diagnostic purposes.

In this section, we detail what pieces of information can be derived from evaluation results, and how to score them.

Notation

Registered speaker population

We will denote as a population of m registered speakers :

For scoring purposes, we will consider a set of test utterances. We will use the term genuine test utterances for those which correspond to registered speakers, and the term impostor test utterances for those which are treated as belonging to impostor speakers. Note that the same speech segment can be used as a genuine utterance and as an impostor utterance, in some test configurations. Therefore, different notations can correspond to the same speech utterance.

Each registered speaker is supposed to have produced genuine test utterances, the set of which will be denoted as :

where superscript k denotes the test utterance of speaker .

In the rest of this chapter, we will denote as c the total number of genuine test utterances, and as the proportion of utterances belonging to speaker in the test set, i.e :

With the convention :

we will denote :

Integer is the number of registered speakers for which there is at least one genuine test utterance.

Finally, we will denote as M the set of male registered speakers, as F the set of female registered speakers and as and the respective number of male and female registered speakers for which there is at least one genuine test utterance.

Test impostor population

In the most general case, the whole set of impostor test utterances can be divided in subsets corresponding to one among n impostors using the system with a claimed identity . Hence the (heavy) general notation :

to denote the set of impostor test utterances produced by impostor claiming he is .

Similarily to genuine test utterances, we will denote as d the total number of impostor utterances, and as the proportion of impostor tests by impostor against registered speaker , i.e :

We will also denote :

When the identity of impostors is not a relevant factor (or is unknown), subscript j can be dropped, and denotes the impostor attempt against registered speaker (). Conversely, in open-set speaker identification, impostors do not claim a particular identity; they just try to be identified as one of the registered speakers, who ever this speaker may be. In this case, subscript can be dropped, and denotes the impostor attempt from impostor (). If, moreover, the impostor's identity does not matter, subscript j can also be dropped, and simply denotes the impostor attempt ().

With the conventions :

we will denote :

and

Integer is the number of impostors for which there is at least one test utterance against registered speaker , integer represents the number of registered speakers against which there is at least one test utterance from impostor , is the number of impostors from which there is at least one impostor test utterance, is the number of registered speakers against which there is at least one impostor test utterance and denotes the total number of couples for which there is at least one impostor test utterance (from against ).

Finally, we will denote as the set of male impostor speakers, as the set of female impostors speakers, and

: : the number of male registered speakers against which there is at least one impostor test utterance
: : the number of female registered speakers against which there is at least one impostor test utterance
: : the number of couples for which there is at least one impostor test utterance (from against , being a male speaker)
: : the number of couples for which there is at least one impostor test utterance (from against , being a female speaker)

Closed-set identification

A closed-set identification system can be viewed as a function which assigns, to any test utterance z, an estimated speaker index , corresponding to the identified speaker in the set of registered speakers.

In closed-set identification, all test utterances belong to one of the registered speakers. Therefore, a misclassification error occurs for test utterance number k produced by speaker when :

where denotes the Kronecker function.

Misclassification rates

The most natural figure that indicates the performance of a speaker identification system is the relative number of times the system fails in identifying correctly an applicant speaker; in other words, how often a test utterance will be assigned an erroneous identity. Whereas it is straightforward to calculate a performance figure on a speaker-by-speaker basis, care should be taken when deriving a global score.

With our notation, and assuming that , we define the misclassification rate for speaker as :

If we denote as the probability that the system under test identifies speaker when he is not the actual speaker, the quantity provides an estimate of this probability, whereas provides an estimate of . However, it is preferable to report error scores rather than success scores, and performance improvements should be measured as relative error rate reduction. If , is undefined but .

We suggest to use the term dependable speaker to qualify a registered speaker with a low misclassification rate, and the term unreliable speaker for a speaker with a h misclassification rate.

From speaker-by-speaker figures, the average misclassification rate can be derived as :

and by computing separately :

the gender-balanced misclassification rate can be obtained as :

The previous scores are different from the test set misclassification rate, calculated as :

Scores and are formally identical if and only if does not depend on i, i.e when the test set contains an identical number test utterances for each speaker. As it is usually observed that speaker recognition performances may vary with the speaker's gender, the comparison of and can show significant differences, if the registered population is not gender-balanced. Therefore, we believe that an accurate description of the identification performance requires the 3 numbers , , and to be provided.

Mistrust rates

Taking an other point of view, performance scores can be designed to measure how reliable is the decision of the system when it has assigned a given identity; in other words, to provide an estimate of the probability , i.e the probability that the speaker is not really when the system under test has output as the most likely identity.

To define the mistrust rate, we have to introduce the following notation :

By definition, and are respectively the number and proportion of test utterances identified as over the whole test set, while is the number of registered speakers whose identity was assigned at least once to a test utterance.

The mistrust rate for speaker can then be computed (for ) as :

Here again, if , is undefined, but .

We suggest that the term resistant speaker could be used to qualify a registered speaker with a low mistrust rate, and the term vulnerable speaker for a speaker with a high mistrust rate.

From this speaker-by-speaker score, the average mistrust rate can be derived as :

and by computing separately :

the gender-balanced mistrust rate is defined as :

By noticing now that :

there appears to be no need to define a test set mistrust rate . In other words : the test set mistrust rate is equal to the test set misclassification rate.

From a practical point of view, misclassification rates and mistrust rates can be obtained by the exact same scoring programs, operating successively on the confusion matrix and on its transpose.

Confidence ranks

Most speaker identification systems use a similarity measure between a test utterance and all training patterns to decide, by a nearest neighbour decision rule, which is the identity of the test speaker. In this case, for a test utterance x, an ordered list of registered speaker can be produced :

where, for all index , is judged closer to the test utterance than is.

The identification rank of the genuine speaker of utterance can then be expressed as :

In other words, is the position at which appears the correct speaker in the ordered list of neighbours of the test utterance. Note that a correct identification of corresponds to .

Under the assumption that , let us now denote :

The confidence rank for speaker , which we will denote here as can then be defined as the smallest integer number for which of the test utterances belonging to speaker are part of the nearest neighbours in the ordered list of candidates. Hence the formulation :

Then, the average confidence rank can be computed as the average of over all registered speakers (for which ) :

Though a gender-balanced confidence rank could be defined analogously to gender-balanced misclassification and mistrust rates, the relevance of this figure is not clear.

If finally we denote :

the test set confidence rank is defined as :

Comments

Average scores, gender-balanced scores and test set scores all fall under a same formalism. If we denote as a certain quantity which we will call the relative representativity of speaker and which satisfies , and if we now consider the linear combination :

it is clear that :

Therefore , and correspond to different estimates of a global score, under various assumptions on the relative representativity of each test speaker. For average scores, it is assumed that each speaker is equally representative, irrespectively of its sex group, but if the test population is strongly unbalanced this hypothesis may not be relevant (unless there is a reason for it). For gender-balanced scores, each test speaker is supposed to be representative of its sex group, and each sex group is supposed to be equiprobable. Test set scores make the assumption that each test speaker has a representativity which is proportional to its number of test utterances, which is certainly not always a meaningful hypothesis.

Test set scores can therefore be used as an overall performance measure if the test data represent accurately the profile and behaviour of the user population, i.e both in terms of population composition and individual frequency of use. If only the composition of the test set population is representative of the general user population, average scores allow to neutralise the possible discrepancies in number of utterances per speaker. If finally the composition of the test set spekaer population is not representative, gender-balanced score provide a general purpose estimate.

If there is a way to estimate separately the relative representativity for each test registered speaker, a representative misclassification rate can be computed as in equation (). Conversely, some techniques such as those used in public opinion polls can be resorted to in order to select a representative test population, when setting up an evaluation experiment.

Example

Figures and give examples of misclassification rates, mistrust rates and confidence ranks. However, it must be kept in mind that the number of tests used to design these examples is too small to guarantee any statistical validity of the figures.

Figure: An example illustrating how to score misclassification rates and mistrust rates from a confusion matrix, in speaker identification. Here would be a dependable speaker, an unreliable speaker, , and would appear as resistant speakers and (him again !) would seem to be a vulnerable speaker.

Figure: An example illustrating how to score confidence ranks in closed-set speaker identification. Only for speaker is there less than half of the time the right speaker in first position ( except for i = 2). On the test set, the list of the 3 best candidates would have to be kept in order to be sure that the right speaker is in the list 95 % of the time ().

Verification

A verification system can be viewed as a function which assigns, to a test utterance z and a claimed identity , a boolean value , which is equal to 1 if the utterance is accepted, and 0 if it is rejected.

Two type of errors can then occur. Either a genuine speaker is rejected, or an impostor is accepted. Hence, a false rejection corresponds to :

and a false acceptance happens if :

In the rest of this section, we will denote as follows the events :

: the system accepts the applicant speaker
: the system rejects the applicant speaker
X : the applicant speaker is a genuine speaker
Y : the applicant speaker is an impostor ()
i or j : the identity of the applicant speaker is or
: the claimed identity is i

We first address aspects of static evaluation, that is what meaningful figures can be computed to measure the performance of a system over which the experimentator has absolutely no control. Then, after discussing the role of decision thresholds, we review several approaches that allow to obtain a dynamic evaluation of the system, i.e in a relatively threshold-independent manner.

False rejection rates

If , the false rejection rate for speaker is defined quite naturally as :

Rate provides an estimate of , i.e the probability that the system makes a diagnostic of rejection, given that the applicant speaker is the authorised speaker (claiming his own identity). If , is undefined but .

As for closed-set identification, the terms dependable speaker and unreliable speaker can be used to qualify speakers with a low or (respectively) high false rejection rate.

From speaker based figures, the average false rejection rate can be obtained as :

while the gender-balanced false rejection rate is :

where :

The test set false rejection rate is calculated as :

Rates , and provide 3 different estimates of . Rate is influenced by the test set distribution of genuine attempts which may only be artefactual. The issue of "representativity" of a test speaker, raised in section , is also relevant here, and for false acceptance rates as well.

False acceptance rates and imposture rates

As opposed to false rejection, there are several ways to score false acceptance, depending on whether it is the vulnerability of registered speaker which is considered or the skills of impostors. Moreover, the way to evaluate false acceptance rates and imposture rates depends on whether the identity of each impostor is known or not.

Known impostors

If the impostor identities are known, the false acceptance rate in favour of impostor against registered speaker can be defined, for , as :

Here, can be viewed as an estimate of , i.e the probability that the system makes a diagnostic of acceptance, given that the applicant speaker is the impostor claiming identity .

Then, the average false acceptance rate against speaker can be obtained (if ) by averaging the false acceptances over all impostors :

and similarily the average imposture rate in favour of impostor can be calculated (for ) as :

Rates and provide (respectively) estimates of and under the assumption that all impostors and all claimed identities are equiprobable. The number indicates the false acceptance rate obtained in average by each impostor in claiming identity , while indicates the success rate of impostor in claiming an identity averaged over each claimed identity. A registered speaker can be more or less resistant (low ) or vulnerable (high ), whereas impostors with a high can be viewed as skilled impostors, as opposed to poor impostors for those with a low .

The average false acceptance rate which is equal to the average imposture rate is obtained as :

i.e as the average of the false acceptances over all couples , which provides an estimate of under the assumption that all couples are equally likely.

Here, separate estimates of the average false acceptance rate on the male and female registered populations can be obtained as :

and a gender-balanced false acceptance rate is provided by :

The question could be raised of whether it is desirable to compute a score which would provide an estimation of the false acceptance rate for a gender-balanced impostor population. We propose not to go that far, as it would clearly lead to duplication of scoring figures, but the influence of impostors' gender could be partly neutralised by the experimental design :

If the impostor population is composed of acquainted intentional impostors, the issue of impostor's gender balancing can be considered as relatively marginal, even though the impostor of a given sex may be more skilled in imitating a same-sex person than somebody of the opposite sex.
If the impostor population is composed of casual impostors, we propose to restrict systematically the impostor utterance test set to same-sex trials. However, as we mentioned it earlier, it is safer to check in an independent experiment, that the system is really robust to cross-sex casual impostors.

It may also be interesting to calculate imposture rates regardless of the claimed identities. In this case, we define the imposture rate in favour of impostor regardless of the claimed identity as :

and the average imposture rate regardless of the claimed identity as :

However, , , , , , and can not be evaluated when the identities of impostors are not known. In this case false acceptance rates and imposture rates can be calculated under the assumption that all impostor test utterances are produced by distinct impostors.

Unknown impostors

The false acceptance rate against speaker assuming distinct impostors can be obtained (if ) as :

and the average false acceptance rate assuming distinct impostors is defined as :

Here again, separate estimates of the average false acceptance rate assuming distinct impostors, on the male and female registered populations can be obtained as :

with the gender-balanced false acceptance rate assuming distinct impostors being :

Rate provides a speaker-dependent estimate of assuming distinct impostors. Rate can be viewed as an estimate of under the assumptions of distinct impostors and that all claimed identities are equally likely while can be understood as another estimate of under the assumptions of distinct impostors, that attempts against male speakers and against female speakers are equiprobable, and that within a gender class all claimed identities are equally likely.

Test set scores

If finally false acceptances are scored globally, regardless of the impostor identity nor of the claimed identity, we obtain the test set false acceptance rate which is identical to the test set imposture rate :

Here, provides a test set estimate of which is biased by the composition of the registered population and a possible uneveness of the number of impostor trials for each speaker.

Summary

For scoring false acceptance rates, we believe that, beside , it is necessary to report on and (when impostors are known) or and (when they are not known), as the score may be significantly influenced by the test data distribution. The other scores described in this section are mainly useful for diagnostic analysis.

Relative unreliability, vulnerability and imitation ability

It can also be of major interest to estimate the contribution of a given registered speaker to the overall false rejection rate, which can be denoted as , i.e the probability that the identity of the speaker was i given that a (false) rejection diagnostic was made on a genuine speaker (claiming his own identity).

We can thus define the average relative unreliability for speaker as :

or his test set relative unreliability :

By construction :

From a different angle, the relative vulnerability for a given registered speaker (i.e ) can be measured as his contribution to the false acceptance rate.

Thus, the average relative vulnerability for speaker can be defined as :

his relative vulnerability assuming distinct impostors, as :

and his test set relative vulnerability as :

Here :

Finally, by considering the relative success of impostor , i.e , we define in a dual way as above, the average imitation ability of impostor :

his imitation ability regardless of the claimed identity :

and his test set relative imitation ability :

Naturally :

The relative unreliability and vulnerability can also be calculated relatively to the male/female population.

Comments

As for misclassification rates, the gender-balanced, average and test set false rejection rates as well as the gender-balanced and average false acceptance rates assuming distinct impostors and the test set false acceptance rate correspond to different estimates of a global score, under various assumptions on the relative representativity of each genuine test speaker. The discussion of section can be readily generalised.

For what concerns gender-balanced and average false acceptance rates with known impostors, a relative representativity can be defined for each couple of registered speaker and impostor () (with ), and if we denote :

we have :

In the case of casual impostors, choosing a selective attempt configuration towards same-sex speakers is equivalent to the assumption :

i.e that the representativity of a cross-sex attempt is zero.

Studies allowing to better define the representativity of impostor attempts against registered speaker would be of great help to increase the relevance of evaluation scores.

Example

Figure , and give examples of false acceptance rates, false rejection rates, and imposture rates, as well as unreliability, vulnerability and imitation ability. As for the closed-set identification examples, the number of tests used to design these examples is too small to guarantee any statistical validity.

Figure: Out of 18 genuine attempts, 6 false rejections are observed, hence the test set false acceptance rate . Nevertheless, the 3 false rejections out of 9 trials for do not have the same impact on the average false rejection rate than the 3 false rejections out of 7 trials for . In fact, whereas seems to be the most reliable speaker, appears more unreliable than in the average, as, for what concerns relative unreliability scores, .

Figure: One out of three impostor trials from against were successful while none from were. Hence . But if the identities of impostors are not known, it can only be measured that, out of 8 impostor attempts against , 2 were successful, i.e . As no impostor attempt from against was recorded, the average false acceptance rate against can only be averaged over 1 impostor. Hence . The 3 ways of computing false acceptance rates, namely the average false acceptance rate , the average false acceptance rate assuming distinct impostors and the test set false acceptance rate provide significantly different scores, as the number of test utterances is not balanced across all possible couples . In this example, the relative vulnerability scores , and indicate that speaker would appear as the most resistant, while speaker would seem to be the most vulnerable.

Figure: Out of 6 trials from impostor against speaker , 2 of them turned out to be successful, while out of 6 other trials against , 5 lead to a (false) acceptance. As no attempts from against were recorded, the average imposture rate from impostor can be estimated as . If we now ignore the actual identities of violated speakers, and we summarise globally the success of impostor , we get which turns out to be also equal to . While and , the average imposture rate regardless of the claimed identity indicates that the "average" impostor is successful almost 2 times out of 5 in his attempts. All estimates of the relative imitation ability (, and ) agree that is a much more skilled impostor than who seems to be quite poor.

Expected benefit

From now on, we will denote as and the false rejection and rejection rates, whichever exact estimate is really chosen.

If it is possible to get estimates of the following quantities :

p, the probability that an applicant speaker is a genuine speaker,
, the benefit of a true acceptance,
, the benefit of a true rejection,
, the cost of a false rejection,
, the cost of a false acceptance,

the expected benefit

of a verification system with false rejection rate

and false acceptance rate

can be computed as :

In particular, when and , the equal-risk equal-cost expected benefit is :

The expected benefit is usually a meaningful static evaluation figure for the potential clients of the technology. It must however be understood only as the average expected benefit for each user attempt. It does not take into account external factors such as the psychological impact of the system, its maintenance costs, etc...

Threshold setting

Speaker verification systems usually proceed in two steps. First, a matching score is computed between the test utterance z and the reference model corresponding to the claimed identity. Then, the value of the matching score is compared to a threshold , and a decision is taken as follows :

In other words, verification is positive only if the match between the test utterance and the reference model (for the claimed identity) is close enough.

A distinction can be made depending on whether each registered speaker has his individual threshold or whether a single threshold is used which is common to all speakers. In other words, if depends on i, the system uses speaker-dependent thresholds, whereas if does not depend on i, the system uses a speaker-independent threshold. We will denote as the threshold vector , and as and the false rejection and acceptance rates corresponding to .

The values of have an inverse impact on the false rejection rate and on the false acceptance rate. Thus, with a low , fewer genuine attempts from speaker will be rejected but more impostors will be erroneously accepted as . Conversely, if is increased, will generally decrease, at the expense of an increasing . The goal of dynamic evaluation is to provide a description of the system performance which is as independent as possible of the threshold values.

The setting of thresholds is conditioned to the specification of an operating constraint which expresses the compromise that has to be reached between the two types of errors. Among many possibilities, the most popular ones are :

A specified false rejection rate . If speaker-dependent thresholds are used, the thresholds are tuned so that the false rejection rate for each speaker is equal to whereas, with speaker-independent thresholds, the constraint is only satisfied in the average.
A specified false acceptance rate . Here also, the constraint can be satisfied for each speaker with speaker-dependent thresholds, or in the average for speaker-independent thresholds.
The maximisation of the expected benefit. Once again, the corresponding and can be obtained by a speaker-by-speaker optimisation or on an average basis.
An equal error rate (or EER) . In fact, this is the most popular constraint, as the equal error rate is felt as a simple way to summarise the overall performance of a system into a single figure. Moreover, for any threshold :

In most practical applications however, the equal error rate does not correspond to an interesting operating constraint.

Two procedures are classically used to set the thresholds : the a priori threshold setting procedure and the a posteriori threshold setting procedure.

When the a priori threshold setting procedure is implemented, the threshold vector is estimated from a set of tuning data which can be either the training data themselves, or a new set of unseen data. Then, the false rejection and acceptance rates and are estimated on a disjoint test set. Naturally, there must be no intersection between the tuning data set and the test data set. Not only must the speech material of genuine attempts and impostor attempts be different between these two sets, but also the bundle of pseudo-impostors used to tune the threshold for a registered speaker should not contain any of the impostors which will be tested against this very speaker within the test set. Of course, the volume of additional speech data used for threshold setting must be counted as training material, when reporting on the training speech quantity.

When the a posteriori threshold setting procedure is adopted, is set on the test data themselves. In this case, the false rejection and acceptance rates and must be understood as the performance of the system with ideal thresholds. Though this procedure does not lead to a fair measure of the system performance, it can be interesting, for diagnostic evaluation, to compare and with and .

System operating characteristic

Whichever operating constraint is chosen to tune the thresholds is only one of the infinite number of possible trade-off, and it is generally not possible to predict, from the false rejection and false acceptance rates obtained for a particular functioning point, what would be the error rates for an other functioning point. In order to be able to estimate the performance of the system under any conditions, its behaviour has to be modeled so that its performance can be characterised independently from any threshold settings.

Speaker-independent threshold

In the case of a speaker-independent threshold, the false rejection and the false acceptance rates can be written as functions of a single parameter , namely and . Then, a more compact way of summarising the system's behaviour consists in expressing directly as a function of (or the opposite), i.e :

Using a terminology derived from Communication Theory, function f is sometimes called the Receiver Operating Characteristic and the corresponding curve the ROC curve. Generally, function f is monotonically decreasing, and satisfies the limit conditions and . Figure depicts a typical ROC curve.

The point-by-boint knowledge of function f provides a threshold-independent description of all possible functioning conditions of the system. In particular :

If a false rejection rate is specified, the corresponding false acceptance rate is obtained as . Graphically, this corresponds to the intersection of the ROC curve with the vertical straight line of equation .
If a false acceptance rate is specified, the corresponding false rejection rate is obtained as . Graphically, this corresponds to the intersection of the ROC curve with the horizontal straight line of equation .
If the expected benefit is to be maximized, the derivation of equation () shows that :
Graphically, the corresponding functioning point is obtained by sliding, from the origin, a straight line with slope , until it becomes tangent to the ROC curve. The point of contact then indicates the corresponding and .
To obtain the equal error rate , the equation has to be solved. This functioning point corresponds to the intersection of the ROC curve with the straight line of equation .

In practice, there are several ROC curves, depending on what type of false rejection and acceptance scores are used :

: a gender-balanced ROC : (or if impostors are unknown),
: an average ROC (or if impostors are unknown),
: a test set ROC .

Any other combination would not make much sense. However, keeping exhaustively a whole ROC curve lacks conciseness, and it is classically felt desirable to condense the system performance in a single figure. Traditionally, the EER is chosen for this purpose, In this case, there is a distinct equal error rate for each ROC curve, which can be denoted respectively

and

Speaker-dependent thresholds

In the case of speaker-dependent thresholds, the false rejection and the false acceptance rates for each speaker depend on a different parameter . Therefore, each speaker has his own ROC curve :

In this case, there is no trivial way for deriving an "average" ROC curve that would represent the general behaviour of the system. Current practice consists in characterising each individual ROC curve by its equal error rate , and in summarising the performance of the system by the average equal error rate computed as :

Note here that a gender-balanced equal error rate can be defined as :

and a test set equal error rate as :

Though we used the same terminology for denominating equal error rates with speaker-dependent and speaker-independent thresholds, it must be stressed that the scores are not comparable. Therefore, it should always be specified in which framework they are computed.

System characteristic modeling

Equal error rates can be interpreted as a very local property of the ROC curve. In fact, as the ROC curve has usually its concavity turned in the direction of the axis , the EER gives an idea of how close the ROC curve is to the axes. However, this is a very incomplete picture of the general system performance level, as it is virtually impossible to predict the performance of the system under a significantly different operating condition.

Recent work by Oglesby [Oglesby 95] has addressed the question of how to encapsulate the entire system characteristic into a single number. Oglesby's suggestions, which we will develop now, consist in finding a simple 1-parameter model which describes as accurately as possible the ROC curve over most of it definition domain. If the approximation is good enough, reasonable error rate estimates for any functioning point can be derived. As in the last section, we will first discuss the case of a system with a speaker-independent threshold, and then extend the approach to speaker-dependent thresholds.

For modeling the relation between and , the simplest approach is to assume a linear operating characteristic, i.e a relation between and of the kind :

where is a constant which can be understood as the linear-model EER. However, typical ROC curves do not have at all a linear shape, and this model is too poor to be effective on a large domain.

A second possibility is to assume that the ROC curve has the approximate shape of the positive branch of a hyperbola, which supposes the relation :

Here is another constant which can be interpreted as the hyperbolic-model EER. The hyperbolic model is equivalent to a linear model in the log-error domain. It usually fits much better the ROC curve. However, it has the drawback of not fulfilling the limit conditions, as and .

A third possibility, proposed by Oglesby, is to use the following model :

where will be refered to as Oglesby's model EER. Oglesby reports a good fit of the model with experimental data, and underlines the fact that and .

The parametric approach is certainly a very relevant way to give a broader system characterisation. Nevertheless, several issues remain questionable.

First, it is clear that none of the models proposed above account for a possible skewness of the ROC curve. As Oglesby notes it, to address skewed characteristics would require introducing an additional variable, which would give rise to a second non-intuitive figure.

A second question is what criterion should be minimised to fit the model curve to the true ROC curve . If we denote as the optimisation domain on which the best fit is to be found, the most natural criterion would be to minimise the mean square error between and over the interval . However, an absolute error difference does not have the same meaning when changes order of magnitude, and an alternative could be to minimise the mean square error between the curves in a log-log representation.

A third and most crucial question is how should the unavoidable deviations between the model and the actual ROC curve be quantified and reported.

Here is a possible answer to these questions. Though the approach that we are going to present has not been extensively tested so far, we believe that it is worth exploring it in the near future, as it may prove useful to summarise concisely the performance of a speaker verification system, in a relatively meaningful and exploitable manner.

The solution proposed starts by fixing an accuracy for the ROC curve modeling, say for instance . Then, if we define :

the following properties are obvious :

Hence, when both constraints are satisfied, both relative differences between the modeled and exact false rejection and acceptance rates are below .

Then, a model of the ROC curve must be chosen, for instance Oglesby's model. However, if another model fits better the curve, it can alternatively be used, but it preferably should depend on a single parameter, and the link between the value of this parameter and the model equal error rate should be specified.

For a given parameter , the lower and upper bound of the -accuracy false rejection rate validity domain, and are obtained by decreasing (respectively increasing) , starting from the initial value , until one of the two constraints of equations () and () are no more satisfied. This process can be repeated for several values of vaying for instance in small steps within the interval . Finally, the value of corresponding to the wider validity domain can be chosen as the system performance measure, in the validity domain of the approximation. Note that does not need to be inside the validity domain, for its value to be meaningful.

If the validity domain turns out to be too small, then the process could be repeated after having set the accuracy to a higher value. Another possibility could be to give several model equal error rates, corresponding to several adjacent validity domains (with a same accuracy ), i.e a piecewise representation of the ROC curve.

A first advantage of the parametric description is that it allows to predict the behaviour of a speaker verification system for a more or less extended set of operating conditions. It could then be possible to give clear answers to a potential client of the technology, if this client is able to specify his constraints. The second advantage is that the model EER is a number which relates well to the conventional EER. Therefore, the new description would not require that the scientific community changes totally its point of view in apprehending the performance of a speaker verification system. The main drawback of the proposed approach is that it lacks experimental validation for the time being. Therefore, we suggest to adopt it as an experimental evaluation methodology, until it will have proven efficient.

If dealing with a system using speaker-dependent thresholds, we are brought back to the difficulty of averaging ROC curve models across speakers. The ROC curve for each speaker can be summarized by a model equal error rate and a -accuracy false acceptance rate validity domain . In the lack of a more relevant solution, we suggest to characterise the average system performance by averaging across speakers the model EER, and the bounds of the validity interval. Thus the global system performance could be given as an average model EER :

and an average -accuracy false acceptance rate validity domain :

where :

The same approach can be implemented, with different weights, to compute a gender-balanced model EER and a test set model EER , and the corresponding validity domains.

Another possibility would be to fix a speaker-independent validity domain for each ROC curve, and then compute the individual accuracy . Then, to obtain a global score, all could be averaged (using weights depending on the type of estimate), and the performance would be a global model equal error rate together with a false acceptance rate domain common to all speakers, but at an average accuracy.

Example

For example, consider a verification system with a speaker-independent threshold that has a gender-balanced Oglesby's equal error rate of with a -accuracy false rejection rate validity domain of . Here, the ROC curve under consideration is . We will denote now and , for simplicity reasons.

For any false rejection rate a satisfying , the difference between the actual false acceptance rate b and the estimated false acceptance rate predicted by Oglesby's model with parameter satisfies . It can then be computed (using equation ()) that the -accuracy false acceptance rate validity domain is , and it is guaranteed that, for any value of b in this interval, the difference between the actual false rejection rate a and the estimated false rejection rate (predicted by Oglesby's model with EER 0.047) satisfies . In particular, the exact (gender-balanced) EER of the system, , is equal to 0.047, at a relative accuracy.

Open-set identification

An open-set identification system can be viewed as a function which assigns, to any test utterance z, an estimated speaker index , corresponding to the identified speaker in the set of registered speakers, or outputs 0 if the applicant speaker is considered as an impostor.

In open-set identification, three types of error can be distinguished :

a misclassification error occurs for a genuine test utterance when :
a false rejection occurs for a genuine test utterance when :
and a false acceptance occurs if, for an impostor test utterance :

Here, two points of view can be adopted.

Either a misclassification error is considered as a false acceptance (while a correct identification is treated as a true acceptance). In this case, open-set identification can be scored in the same way as verification, namely by evaluating a false rejection rate and a false acceptance rate . The concept of ROC curve can be extended to this family of systems, and in particular, an equal error rate can be computed. However, the false acceptance rate is now bounded by a value when the threshold tends to 0, being the closed set misclassification rate of the system, i.e the performance that the open-set identification system would provide if it was functioning in a closed-set mode. Therefore, a parametric approach for dynamic evaluation would require a specific class of ROC curve models (at least with 2 parameters). Moreover, merging classification errors with false acceptances may not be appropriate if the two types of errors are not equally harmful.

An alternative solution is to keep distinct the three types of errors, and measure them by three rates , and . The ROC curve is now a curve in a 3-dimensional space, with equation . The two extremities of this curve are the points with coordinates and . The ROC curve can be projected as and . The first projection is a monotonically decreasing curve such as and , whereas the second projection is also monotonically decreasing, and satisfies and . A minimal description of the curve of could then be the equal error rate of function f and the closed-set identification score from function g. Parametric models of with 2 degrees of freedom could be thought of, but to our knowledge, this remains an unexplored research topic.

Among both possibilities, we believe that the second one is to be prefered, though it is slightly more complex.

Recommendations

These recommendations indicate how the performance of a speaker recognition system should be scored.

For closed-set identification
-
Beside the test-set misclassification rate (), report on average misclassification and mistrust rates ( and ), and provide also gender-balanced rates ( and ) if the test population is composed of male and female speakers.
-
As the number of registered speakers is a crucial factor of performance, it is essential to indicate the number of speakers in the registered speaker population. Mention also the proportion of male and female speakers for information.
-
For statistical validity information, indicate the number and male/female proportion of speakers in the test population and the average number of test utterances per test speaker.
For verification
-
For static evaluation, beside the test set false rejection rate () and the test set false acceptance rate (), provide the average false rejection rate () and the average false acceptance rate ( or depending on whether the impostors' identities are known or not). Gender-balanced rates ( and or ) should also be reported.
-
For dynamic evaluation and a speaker-independent threshold, the system ROC curve should be obtained as :
either (or if impostors are unknown),
or (or if impostors are unknown),
or
Summarise a ROC curve by its traditional equal error rate (respectively , and ). Investigate on the possibility of finding a ROC curve model, and report the model equal error rate (, and ) and the -accuracy false rejection rate validity domain . Find a reasonable compromise between and .
-
For dynamic evaluation with speaker-dependent thresholds, compute the individual equal error rate () of each ROC curve and give the gender-balanced equal error rate (), the average equal error rate () and the test set equal error rate (). Investigate the possibility of fitting a common ROC curve model by adjusting individually a model equal error rate () for each curve. Here, either an accuracy is fixed and speaker-dependent validity domains are computed, or the validity domain is fixed in a speaker-independent manner and the individual accuracy is computed. Compute anyway the global model equal error rates (, and ). Then give accordingly either the average validity domain for a speaker-independent accuracy or the average accuracy for a speaker-independent validity domain.
-
For statistical validity information, indicate the number of registered speakers, the proportion of male and female registered speakers, the number of genuine test speakers, the proportion of male and female genuine test speakers and the average number of genuine test utterance per genuine test speaker. Give also a relevant description of the test impostor configuration and population.
For open-set identification
-
For static evaluation, score separately the false rejections (, and ) , the false acceptances (, or and or ) and the misclassifications (, and ).
-
For dynamic evaluation and a speaker-independent threshold, project the 3-dimensional ROC curve into 2 curves and . Summarise the first one as its equal error rate and the second one as its extremity . Investigate the possibility of using a parametric approach.
-
For dynamic evaluation with speaker-dependent thresholds, average individual and individual . Investigate the possibility of using a parametric approach.
-
As for closed-set identification and for verification, give all relevant information concerning the registered population, the genuine test population and the impostor population and test configuration.