The design and implementation of an experimental evaluation protocol is usually intended to provide an estimation of a system performance. Beside the overall figures of error rates, other relevant quantities can be derived from the analysis of the test results, for diagnostic purposes.
In this section, we detail what pieces of information can be derived from evaluation results, and how to score them.
We will denote as a population of m registered speakers :
For scoring purposes, we will consider a set of test utterances. We will use the term genuine test utterances for those which correspond to registered speakers, and the term impostor test utterances for those which are treated as belonging to impostor speakers. Note that the same speech segment can be used as a genuine utterance and as an impostor utterance, in some test configurations. Therefore, different notations can correspond to the same speech utterance.
Each registered speaker is supposed to have produced
genuine test utterances, the set of which will be denoted as
:
where superscript k denotes the test utterance of speaker
.
In the rest of this chapter, we will denote as c the total number of genuine test utterances, and as the proportion of utterances belonging to speaker
in the test set, i.e :
With the convention :
we will denote :
Integer is the number of registered speakers for which there is at least one genuine test utterance.
Finally, we will denote as M the set of male
registered speakers, as F the set of female registered speakers and as and
the respective number of male and female registered
speakers for which there is at least one genuine test utterance.
In the most general case, the whole set of impostor test utterances can be divided in subsets
corresponding to one among n impostors using the system with a claimed identity
. Hence the (heavy) general notation :
to denote the set of impostor test utterances produced by impostor
claiming he is
.
Similarily to genuine test utterances, we will denote as d the total number of impostor utterances, and as the proportion of impostor tests by impostor
against registered speaker
, i.e :
We will also denote :
When the identity of impostors is not a relevant factor (or is unknown), subscript j can be dropped, and denotes the
impostor attempt against registered speaker
(
). Conversely, in open-set speaker identification, impostors do not claim a
particular identity; they just try to be identified as one of the registered speakers, who ever this speaker may be. In this case, subscript
can be dropped, and
denotes the
impostor attempt from impostor
(
). If, moreover, the
impostor's identity does not matter, subscript j can also be dropped, and
simply denotes the
impostor attempt (
).
With the conventions :
we will denote :
and
Integer is the number of impostors for which there is at least one test utterance against registered speaker
, integer
represents the number of registered speakers against which there is at least one test utterance from impostor
,
is the number of impostors from which there is at least one impostor test utterance,
is the number of registered speakers against
which there is at least one impostor test utterance and
denotes the total number of couples
for which there is at least one impostor test utterance
(from
against
).
Finally, we will denote as the set of male impostor speakers, as
the set of female impostors speakers, and
A closed-set identification system can be viewed as a function which assigns, to any test utterance z, an estimated speaker index , corresponding to the identified speaker
in the set of registered speakers.
In closed-set identification, all test utterances belong to one of the
registered speakers. Therefore, a misclassification error occurs for test utterance number k produced by speaker when :
where denotes the Kronecker function.
The most natural figure that indicates the performance of a speaker identification system is the relative number of times the system fails in identifying correctly an applicant speaker; in other words, how often a test utterance will be assigned an erroneous identity. Whereas it is straightforward to calculate a performance figure on a speaker-by-speaker basis, care should be taken when deriving a global score.
With our notation, and assuming that , we define
the misclassification rate for speaker
as :
If we denote as the probability that the system under test identifies speaker
when he is not the actual speaker, the quantity
provides an
estimate of this probability, whereas
provides an estimate of
. However, it is preferable to report error scores rather than success scores, and
performance improvements should be measured as relative error rate reduction
. If
,
is undefined but
.
We suggest to use the term dependable speaker to qualify a registered speaker
with a low misclassification rate, and the term unreliable speaker for a speaker with a h
misclassification rate
.
From speaker-by-speaker figures, the average misclassification rate can be derived as :
and by computing separately :
the gender-balanced misclassification rate can be obtained as :
The previous scores are different from the test set misclassification rate, calculated as :
Scores and
are formally identical if and only if
does not depend on i, i.e when the test set contains an identical number
test
utterances for each speaker. As it is usually observed that speaker recognition performances may vary with the speaker's gender, the comparison of
and
can show significant differences, if the registered population is not gender-balanced. Therefore, we believe that an accurate description of the identification performance requires the 3 numbers
,
, and
to be provided.
Taking an other point of view,
performance scores can be designed to measure how reliable is the decision of the system when it has assigned a given identity; in other words, to provide an estimate of the probability , i.e the
probability that the speaker is not really
when the system under test has output
as the most likely identity.
To define the mistrust rate, we have to introduce the following notation :
By definition,
and
are respectively the number and proportion of test utterances identified as
over the whole test set, while
is the number of registered speakers whose identity was assigned at least once to a test utterance.
The mistrust rate for speaker can then be computed (for
) as :
Here again, if ,
is undefined, but
.
We suggest that the term resistant speaker could be used to qualify a registered speaker with a low mistrust rate, and the term vulnerable speaker for a speaker with a high mistrust rate
.
From this speaker-by-speaker score, the average mistrust rate can be derived as :
the gender-balanced mistrust rate is defined as :
By noticing now that :
there appears to be no need to define a
test set mistrust rate . In other words : the test set mistrust rate is equal to the test set misclassification rate
.
From a practical point of view, misclassification rates and mistrust rates can be obtained by the exact same scoring programs, operating successively on the confusion matrix and on its transpose.
Most speaker identification systems use a similarity measure between a test utterance and all training patterns to decide, by a nearest neighbour decision rule, which is the identity of the test speaker. In this case, for a test utterance x, an ordered list of registered speaker can be produced :
where, for all index ,
is judged closer to the test utterance than
is.
The identification rank of the genuine
speaker of utterance can then be expressed as :
In other words, is the position at which appears the correct speaker in the ordered list of neighbours of the test utterance. Note that a correct identification of
corresponds to
.
Under the assumption that , let us now denote :
The confidence rank for speaker
, which we will denote here as
can then be defined as the smallest integer number for which
of the test utterances belonging to speaker
are part of the
nearest neighbours in the ordered list of candidates. Hence the formulation :
Then, the average confidence rank can be computed as the average of
over all registered speakers (for which
) :
Though a gender-balanced confidence rank could be defined analogously to gender-balanced misclassification and mistrust rates, the relevance of this figure is not clear.
If finally we denote :
the test set confidence rank is defined as :
Average scores, gender-balanced scores and test set scores all fall under a same formalism. If we denote as a certain quantity which we will call the relative representativity of speaker
and which satisfies
, and if we now consider the linear combination :
it is clear that :
Therefore ,
and
correspond to different estimates of a global score, under various assumptions on the relative
representativity of each test speaker
. For average scores, it is assumed that each speaker is
equally representative, irrespectively of its sex group, but if the test population is strongly unbalanced this hypothesis may not be relevant (unless there is a reason for it). For gender-balanced scores, each test speaker is supposed to be
representative of its sex group, and each sex group is supposed to be equiprobable. Test set scores make the assumption that each test speaker has a representativity which is proportional to its number of test utterances, which is certainly not always a
meaningful hypothesis.
Test set scores can therefore be used as an overall performance measure if the test data represent accurately the profile and behaviour of the user population, i.e both in terms of population composition and individual frequency of use. If only the composition of the test set population is representative of the general user population, average scores allow to neutralise the possible discrepancies in number of utterances per speaker. If finally the composition of the test set spekaer population is not representative, gender-balanced score provide a general purpose estimate.
If there is a way to estimate separately the relative representativity for each test registered speaker, a representative
misclassification rate can be computed as in equation (). Conversely, some techniques such as those used
in public opinion polls can be resorted to in order to select a representative test population, when setting up an evaluation experiment.
Figures and
give examples of misclassification rates, mistrust rates and confidence ranks. However, it must be kept in mind that the number of tests used to design these examples is
too small to guarantee any statistical validity of the figures.
Figure: An example illustrating how to score
misclassification rates and mistrust rates from a confusion matrix, in speaker identification. Here would be a dependable speaker,
an
unreliable speaker,
,
and
would appear as resistant speakers and
(him again !) would seem to be a vulnerable speaker.
Figure: An
example illustrating how to score confidence ranks in closed-set speaker identification. Only for speaker is there less than half of the time the right speaker in first position (
except for i = 2). On the test set, the list of the 3 best candidates would have to be kept in order to be sure that the right speaker is in the list 95 % of the time (
).
A verification system can be viewed as a function which assigns, to a test utterance z and a claimed identity , a boolean value
, which is equal to 1 if the utterance is accepted, and 0 if it is rejected.
Two type of errors can then occur. Either a genuine speaker is rejected, or an impostor is accepted. Hence, a false rejection corresponds to :
and a false acceptance happens if :
In the rest of this section, we will denote as follows the events :
We first address aspects of static evaluation, that is what meaningful figures can be computed to measure the performance of a system over which the experimentator has absolutely no control. Then, after discussing the role of decision thresholds, we review several approaches that allow to obtain a dynamic evaluation of the system, i.e in a relatively threshold-independent manner.
If , the false rejection rate for speaker
is defined quite naturally as :
Rate provides an estimate of
, i.e the probability that the system makes a diagnostic of rejection, given that the applicant speaker is the authorised speaker
(claiming his own identity). If
,
is undefined but
.
As for closed-set identification, the terms dependable speaker and unreliable speaker can be used to qualify speakers with a low or (respectively) high false rejection rate.
From speaker based figures, the average false rejection rate can be obtained as :
while the gender-balanced false rejection rate is :
where :
The test set false rejection rate is calculated as :
Rates ,
and
provide 3 different estimates of
. Rate
is influenced by the test set distribution of genuine attempts which may only be
artefactual. The issue of "representativity" of a test speaker, raised in section
, is also
relevant here, and for false acceptance rates as well.
As opposed to false rejection, there are several ways to score false acceptance, depending on
whether it is the vulnerability of registered speaker which is considered or the skills of impostors. Moreover, the way to evaluate false acceptance rates and imposture rates depends on whether the identity of each impostor is known or not.
Known impostors
If the impostor identities are known, the false acceptance rate in favour of impostor against registered speaker
can be
defined, for
, as :
Here, can be viewed as an estimate of
, i.e the probability that the system makes a diagnostic of acceptance, given that the applicant speaker is the impostor
claiming identity
.
Then, the average false acceptance rate against speaker can be obtained (if
) by averaging the false acceptances
over all impostors :
and similarily the average imposture rate in favour of impostor can be calculated (for
) as :
Rates and
provide (respectively) estimates of
and
under the assumption that all impostors and all claimed identities are equiprobable. The number
indicates the false acceptance rate obtained in average by each impostor in claiming identity
, while
indicates the success rate of
impostor
in claiming an identity averaged over each claimed identity. A registered speaker can be more or less resistant (low
) or
vulnerable (high
), whereas impostors with a high
can be viewed as skilled impostors
, as opposed to poor impostors
for those with a low
.
The average false acceptance rate which is equal to the average imposture rate is obtained as :
i.e as the average of the false acceptances over all couples , which provides an estimate of
under the assumption that all couples
are equally likely.
Here, separate estimates of the average false acceptance rate on the male and female registered populations can be obtained as :
and a gender-balanced false acceptance rate is provided by :
The question could be raised of whether it is desirable to compute a score which would provide an estimation of the false acceptance rate for a gender-balanced impostor population. We propose not to go that far, as it would clearly lead to duplication of scoring figures, but the influence of impostors' gender could be partly neutralised by the experimental design :
It may also be interesting to calculate imposture rates regardless of the claimed identities. In this case, we define
the imposture rate in favour of impostor regardless of the claimed identity as :
and the average imposture rate regardless of the claimed identity as :
However, ,
,
,
,
,
and
can not be
evaluated when the identities of impostors are not known. In this case false acceptance rates and imposture rates can be calculated under the assumption that all impostor test utterances are produced by distinct impostors.
Unknown impostors
The false acceptance rate against speaker assuming distinct impostors can be obtained (if
) as :
and the average false acceptance rate assuming distinct impostors is defined as :
Here again, separate estimates of the average false acceptance rate assuming distinct impostors, on the male and female registered populations can be obtained as :
with the gender-balanced false acceptance rate assuming distinct impostors being :
Rate provides a speaker-dependent estimate of
assuming distinct
impostors. Rate
can be viewed as an estimate of
under the assumptions of distinct impostors and that all claimed identities are equally likely while
can be understood as another estimate of
under the assumptions of distinct impostors, that attempts against male speakers and against female speakers
are equiprobable, and that within a gender class all claimed identities are equally likely.
Test set scores
If finally false acceptances are scored globally, regardless of the impostor identity nor of the claimed identity, we obtain the test set false acceptance rate which is identical to the test set imposture rate :
Here, provides a test set
estimate of
which is biased by the composition of the registered population and a possible uneveness of the number of impostor trials for each speaker
.
Summary
For scoring false acceptance rates, we believe that, beside , it is necessary to report on
and
(when impostors are known) or
and
(when they are not known), as the score
may be significantly influenced by the test data distribution. The other scores described in this section are
mainly useful for diagnostic analysis.
It can also be of major interest to estimate the contribution of a given registered speaker
to the overall false rejection rate, which can be denoted as
, i.e the probability that the identity of the speaker was i given that a (false)
rejection diagnostic was made on a genuine speaker (claiming his own identity).
We can thus define the average relative unreliability for speaker as :
or his test set relative unreliability :
By construction :
From a different angle, the relative
vulnerability for a given registered speaker (i.e
) can be measured as his contribution to the false acceptance rate.
Thus, the average
relative vulnerability for speaker can be defined as :
his relative vulnerability assuming distinct impostors, as :
and his test set relative vulnerability as :
Here :
Finally, by
considering the relative success of impostor , i.e
, we define in a dual way as above, the average imitation ability of impostor
:
his imitation ability regardless of the claimed identity :
and his test set relative imitation ability :
Naturally :
The relative unreliability and vulnerability can also be calculated relatively to the male/female population.
As for misclassification rates, the gender-balanced, average and test set false rejection rates as well as the gender-balanced and average false acceptance
rates assuming distinct impostors and the test set false acceptance rate correspond to different estimates of a global score, under various assumptions on the relative representativity of each genuine test speaker. The discussion of section can be readily generalised.
For what concerns gender-balanced and average false acceptance rates with known
impostors, a relative representativity can be defined for each couple of registered speaker and impostor (
) (with
), and if we denote :
we have :
In the case of casual impostors, choosing a selective attempt configuration towards same-sex speakers is equivalent to the assumption :
i.e that the representativity of a cross-sex attempt is zero.
Studies allowing to better define the representativity of impostor attempts against registered speaker would be of great help to increase the relevance of evaluation scores.
Figure ,
and
give examples of false acceptance rates, false rejection rates, and imposture rates, as well as unreliability,
vulnerability and imitation ability. As for the closed-set identification examples, the number of tests used to design these examples is too small to guarantee any statistical validity.
Figure: Out of 18 genuine attempts, 6 false rejections are observed, hence the test set false acceptance rate .
Nevertheless, the 3 false rejections out of 9 trials for
do not have the same impact on the average false rejection rate
than the 3 false
rejections out of 7 trials for
. In fact, whereas
seems to be the most reliable speaker,
appears more unreliable than
in the average, as, for what concerns relative unreliability scores,
.
Figure: One out of three impostor trials from against
were successful while none from
were. Hence
. But if the identities of impostors are not known, it can only be
measured that, out of 8 impostor attempts against
, 2 were successful, i.e
. As no impostor attempt from
against
was recorded, the average false acceptance rate against
can only be averaged over 1 impostor. Hence
. The 3 ways of computing false acceptance rates, namely the average false acceptance rate
, the average false acceptance rate assuming distinct
impostors
and the test set false acceptance rate
provide significantly different scores, as the number of test utterances is not
balanced across all possible couples
. In this example, the relative vulnerability scores
,
and
indicate that speaker
would appear as the most resistant, while speaker
would seem to be the most vulnerable.
Figure: Out of 6 trials from impostor against speaker
, 2 of them turned out to be successful, while out of 6 other trials against
, 5 lead to
a (false) acceptance. As no attempts from
against
were recorded, the average imposture rate from impostor
can be estimated as
. If we now ignore the actual identities of violated speakers, and we summarise globally the success of impostor
, we get
which turns out to be also equal to
. While
and
, the average imposture rate regardless of the claimed identity
indicates that the "average" impostor is successful almost 2 times out of 5 in
his attempts. All estimates of the relative imitation ability (
,
and
) agree that
is a much more skilled impostor than
who seems to be quite poor.
From now on, we will denote as and
the false rejection and rejection rates, whichever exact estimate is really chosen.
If it is possible to get estimates of the following quantities :
In particular, when and
, the equal-risk equal-cost expected benefit is :
The expected benefit is usually a meaningful static evaluation figure for the potential clients of the technology. It must however be understood only as the average expected benefit for each user attempt. It does not take into account external factors such as the psychological impact of the system, its maintenance costs, etc...
Speaker verification systems usually proceed in two steps. First, a matching score is computed between the test utterance z and
the reference model
corresponding to the claimed identity. Then, the value of the matching score is compared to a threshold
, and a decision is taken
as follows :
In other words, verification is positive only if the match between the test utterance and the reference model (for the claimed identity) is close enough.
A distinction can be
made depending on whether each registered speaker has his individual threshold or whether a single threshold is used which is common to all speakers. In other words, if depends on i, the system
uses speaker-dependent thresholds, whereas if
does not depend on i, the system uses a speaker-independent threshold
. We will denote as
the threshold vector
, and as
and
the false rejection and acceptance rates corresponding to
.
The values of have an inverse impact on the false rejection rate and on the false acceptance rate. Thus, with a low
, fewer genuine attempts from speaker
will be rejected but more impostors will be erroneously accepted as
. Conversely, if
is increased,
will generally decrease, at the expense of an increasing
. The goal of dynamic evaluation is to provide a description of the system performance which is
as independent as possible of the threshold values.
The setting of thresholds is conditioned to the specification of an operating constraint which expresses the compromise that has to be reached between the two types of errors. Among many possibilities, the most popular ones are :
In most practical applications however, the equal error rate does not correspond to an interesting operating constraint.
Two procedures are classically used to set the thresholds : the a priori threshold setting procedure and the a posteriori threshold setting procedure.
When the a priori threshold setting procedure
is implemented, the threshold vector is estimated from a set of tuning data which can be either the training data themselves, or a new set of unseen data. Then, the false rejection and
acceptance rates
and
are estimated on a disjoint test set. Naturally, there must be no intersection between the tuning data set and the test data
set. Not only must the speech material of genuine attempts and impostor attempts be different between these two sets, but also the bundle of pseudo-impostors used to tune the threshold for a registered speaker should not contain any of the impostors
which will be tested against this very speaker within the test set. Of course, the volume of additional speech data used for threshold setting must be counted as training material, when reporting on the training speech quantity.
When the
a posteriori threshold setting procedure is adopted, is set on the test data themselves. In this case, the false rejection and acceptance rates
and
must be understood as the performance of the system with ideal thresholds. Though this procedure does not lead to a fair measure of the system performance, it can be
interesting, for diagnostic evaluation, to compare
and
with
and
.
Whichever operating constraint is chosen to tune the thresholds is only one of the infinite number of possible trade-off, and it is generally
not possible to predict, from the false rejection and false acceptance rates obtained for a particular functioning point, what would be the error rates for an other functioning point. In order to be able to estimate the performance of the system under
any conditions, its behaviour has to be modeled so that its performance can be characterised independently from any threshold settings.
Speaker-independent threshold
In the case of a speaker-independent threshold, the false rejection
and the false acceptance rates can be written as functions of a single parameter , namely
and
.
Then, a more compact way of summarising the system's behaviour consists in expressing directly
as a function of
(or the opposite), i.e :
Using a terminology derived from Communication Theory, function f is sometimes called the Receiver Operating Characteristic and the corresponding curve the ROC curve. Generally, function f is monotonically decreasing, and satisfies the limit conditions
and
. Figure
depicts a typical ROC curve.
The point-by-boint knowledge of function f provides a threshold-independent description of all possible functioning conditions of the system. In particular :
Graphically, the
corresponding functioning point is obtained by sliding, from the origin, a straight line with slope , until it becomes tangent to the ROC curve. The point of contact then indicates the corresponding
and
.
In practice, there are several ROC curves, depending on what type of false rejection and acceptance scores are used :
Speaker-dependent thresholds
In the case of speaker-dependent thresholds, the false rejection and the false acceptance rates for each speaker depend on a different parameter
. Therefore, each speaker has his own ROC curve :
In this case, there is no trivial way for
deriving an "average" ROC curve that would represent the general behaviour of the system. Current practice consists in characterising each individual ROC curve by its equal error rate , and in
summarising the performance of the system by the average equal error rate
computed as :
Note here that a gender-balanced equal
error rate can be defined as :
and a test set equal error rate as :
Though we used the same terminology for denominating equal error rates with speaker-dependent and speaker-independent thresholds, it must be stressed that the scores are not comparable. Therefore, it should always be specified in which framework they are computed.
Equal error rates can be interpreted as a very local property of the ROC curve. In fact, as the ROC curve has usually its
concavity turned in the direction of the axis , the EER gives an idea of how close the ROC curve is to the axes. However, this is a very incomplete picture of the general system performance level, as it
is virtually impossible to predict the performance of the system under a significantly different operating condition.
Recent work by Oglesby [Oglesby 95] has addressed the question of how to encapsulate the entire system characteristic into a single number. Oglesby's suggestions, which we will develop now, consist in finding a simple 1-parameter model which describes as accurately as possible the ROC curve over most of it definition domain. If the approximation is good enough, reasonable error rate estimates for any functioning point can be derived. As in the last section, we will first discuss the case of a system with a speaker-independent threshold, and then extend the approach to speaker-dependent thresholds.
For
modeling the relation between and
, the simplest approach is to assume a linear operating characteristic, i.e a relation between
and
of the kind :
where is a constant which can be
understood as the linear-model EER
. However, typical ROC curves do not have at all a
linear shape, and this model is too poor to be effective on a large domain.
A second possibility is to assume that the ROC curve has the approximate shape of the positive branch of a hyperbola, which supposes the relation :
Here is another constant which can be interpreted as the hyperbolic-model EER. The hyperbolic model is equivalent to a linear model in the log-error
domain. It usually fits much better the ROC curve
. However, it has the drawback of not fulfilling
the limit conditions, as
and
.
A third possibility, proposed by Oglesby, is to use the following model :
where will be refered to as Oglesby's model EER. Oglesby reports a good fit of the model with experimental data, and underlines the fact that
and
.
The parametric approach is certainly a very relevant way to give a broader system characterisation. Nevertheless, several issues remain questionable.
First, it is clear that none of the models proposed above account for a possible skewness of the ROC curve. As Oglesby notes it, to address skewed characteristics would require introducing an additional variable, which would give rise to a second non-intuitive figure.
A second question is what criterion should be minimised to fit the model curve
to the true ROC curve
. If we denote as
the optimisation domain on which the best fit is to be found, the most natural criterion would be to minimise the mean square
error between
and
over the interval
. However, an absolute error difference does not have the same
meaning when
changes order of magnitude, and an alternative could be to minimise the mean square error between the curves in a log-log representation.
A third and most crucial question is how should the unavoidable deviations between the model and the actual ROC curve be quantified and reported.
Here is a possible answer to these questions. Though the approach that we are going to present has not been extensively tested so far, we believe that it is worth exploring it in the near future, as it may prove useful to summarise concisely the performance of a speaker verification system, in a relatively meaningful and exploitable manner.
The solution proposed starts by fixing an
accuracy for the ROC curve modeling, say for instance
. Then, if we define :
the following properties are obvious :
Hence, when both
constraints are satisfied, both relative differences between the modeled and exact false rejection and acceptance rates are below .
Then, a model of the ROC curve must be chosen, for instance Oglesby's model. However, if another model fits better the curve, it can alternatively be used, but it preferably should depend on a single parameter, and the link between the value of this parameter and the model equal error rate should be specified.
For a given parameter , the lower and upper bound of the
-accuracy false rejection rate validity domain,
and
are obtained by decreasing (respectively increasing)
, starting from the initial value
, until one of the two constraints of equations (
) and (
) are no more satisfied. This process can be repeated for several values of
vaying for instance in small steps within the interval
. Finally, the value of
corresponding to the wider validity domain can be
chosen as the system performance measure, in the validity domain of the approximation. Note that
does not need to be inside the validity domain, for its value to be meaningful.
If the
validity domain turns out to be too small, then the process could be repeated after having set the accuracy to a higher value. Another possibility could be to give several model equal error rates,
corresponding to several adjacent validity domains (with a same accuracy
), i.e a piecewise representation of the ROC curve.
A first advantage of the parametric description is that it allows to
predict the behaviour of a speaker verification system for a more or less extended set of operating conditions. It could then be possible to give clear answers to a potential client of the technology, if this client is able to specify his constraints.
The second advantage is that the model EER is a number which relates well to the conventional EER. Therefore, the new description would not require that the scientific community changes totally its point of view in apprehending the performance of a
speaker verification system. The main drawback of the proposed approach is that it lacks
experimental validation for the time being. Therefore, we suggest to adopt it as an experimental evaluation methodology, until it will have proven efficient.
If dealing with a system using speaker-dependent thresholds, we are brought back to the
difficulty of averaging ROC curve models across speakers. The ROC curve for each speaker can be summarized by a model equal error rate
and a
-accuracy false acceptance rate validity domain
. In the lack of a more relevant solution, we suggest to characterise the average system performance by
averaging across speakers the model EER, and the bounds of the validity interval. Thus the global system performance could be given as an average model EER :
and an average -accuracy false acceptance rate validity domain :
where :
The same approach can be
implemented, with different weights, to compute a gender-balanced model EER and a test set model EER
, and the corresponding
validity domains.
Another possibility would be to fix a speaker-independent validity domain for each ROC curve, and then compute the individual accuracy
. Then, to obtain a global
score, all
could be averaged (using weights depending on the type of estimate), and the performance would be a global model equal error rate together with a false acceptance rate domain
common
to all speakers, but at an average accuracy.
For example, consider a verification system with a speaker-independent threshold that has a gender-balanced Oglesby's equal
error rate of with a
-accuracy false rejection rate validity domain of
. Here, the ROC curve
under consideration is
. We will denote now
and
, for simplicity reasons.
For any false
rejection rate a satisfying , the difference between the actual false acceptance rate b and the estimated false acceptance rate
predicted by Oglesby's model with parameter
satisfies
. It can then be computed (using equation (
)) that the
-accuracy false acceptance rate validity domain is
, and it is guaranteed that, for any value of b in this interval, the difference between the actual false rejection rate a and the estimated false rejection rate
(predicted by Oglesby's model with EER 0.047) satisfies
. In particular, the exact (gender-balanced) EER of the system,
, is equal to 0.047, at a
relative accuracy.
An open-set identification system can be viewed as a function which
assigns, to any test utterance z, an estimated speaker index , corresponding to the identified speaker
in the set of registered speakers, or
outputs 0 if the applicant speaker is considered as an impostor.
In open-set identification, three types of error can be distinguished :
Here, two points of view can be adopted.
Either a misclassification error is considered as a false acceptance (while a correct identification is treated as a true acceptance). In this case, open-set identification can be scored in the same way as
verification, namely by evaluating a false rejection rate and a false acceptance rate
. The concept of ROC curve can be extended to this family of
systems, and in particular, an equal error rate
can be computed. However, the false acceptance rate
is now bounded by a value
when the threshold
tends to 0,
being the closed set misclassification rate of the system, i.e the performance
that the open-set identification system would provide if it was functioning in a closed-set mode. Therefore, a parametric approach for dynamic evaluation would require a specific class of ROC curve models (at least with 2 parameters). Moreover, merging
classification errors with false acceptances may not be appropriate if the two types of errors are not equally harmful.
An alternative solution is to keep distinct the three types of errors, and measure them by three rates ,
and
. The ROC curve is now a curve in a 3-dimensional space, with equation
. The two extremities of this curve are the points with coordinates
and
. The ROC curve can be projected as
and
. The first projection is a monotonically decreasing curve such as
and
, whereas the second projection is also monotonically decreasing, and satisfies
and
. A minimal description of the curve of
could then be the equal error rate
of function f and the closed-set identification score
from
function g. Parametric models of
with 2 degrees of freedom could be thought of, but to our knowledge, this remains an unexplored research topic.
Among both possibilities, we believe that the second one is to be prefered, though it is slightly more complex.
These recommendations indicate how the performance of a speaker recognition system should be scored.
In practice, gender-balanced, average and test set scores are obtained very easily as various linear combinations of individual speaker scores.