Next: Typology and assessment Up: Assessment of speaker Previous: Scoring Procedures

Complementary assessment tools

In the previous paragraphs, the focus was centered on performance evaluation of the algorithms with emphasis on scoring procedures. In many cases the scores are meaningless if they are not compared with the scores of other systems on the same database. A reference is necessary to give its full value to a measure. This is particularly true when the measure is based on training and test data which may not be publicly available. In such a case, it is impossible to reproduce the test independently. Databases could be designed to be representative of a particular application or could be more generic. The sampling of speakers may be biased toward similar voices, similar accents, .... The similarities in the voice characteristics of speakers should be quantified anyway. A reference system is a tool to evaluate the complexity of a database.

This section proposes techniques to evaluate databases, provide reference measures for comparison and tools to test the limits of speaker verification systems.

Standard reference systems

A complementary approach to system evaluation consists in defining a verification system, to which other systems are to be compared. For any new system embedded in a given application, the reference voice recognition system could be substituted to the new one, and the differences in the overall performance of the application would give an indirect figure of merit of the new system compared to the reference one, for this given application.

(draw a figure)

Desirable properties of such reference systems are their relative efficiency and robustness and their easy implementation and reproductibility (from the algorithmical point of view). They should not require sophisticated training procedures. In particular, they should be able to operate with very limited training data (like one utterance) per speaker. Such a system allows to measure indirectly the complexity of a task, by giving a reference figure on this task.

Among the features of a reference system, the decision strategy has a considerable impact on the results of the system. Besides the raw performance, the way doubtful decisions are handled is likely to improve considerably the efficiency in an operating mode. This fact should be taken into account when defining a standard system.

Let us examine a few possible candidates.

For text-dependent and text-prompted applications, a baseline DTW system offers a number of advantages:

training could be done with one utterance per speaker
multi-references can be used if more training data are offered
a distance measure between reference and test is returned
the ROC curve can be obtained by varying a threshold

but DTW is very sensitive to end-point detection. Such a reference DTW system was proposed in the context of the SAM-A project [Homayoun 93].

For text-independent applications, a second order statistical (sphericity) measure was proposed in SAM-A [Bimbot 93]. It captures the correlations of a time-frequency representation of the signal. The correlation matrix of a speaker is better estimated on a long period of speech. Nevertheless some encouraging results have been obtained on segments as short as 3 sec.

Both systems were investigated because they are based on totally reproductible algorithmic methods; in particular, they do not require any iterative learning of models, while as Vector Quantization, Hidden Markov Models or Neural Networks performances can be affected by learning strategies. The issue of defining one or two reference systems is certainly crucial for evaluation methodology.

Commercial products could also be evaluated in parallel on the same database. This is not an easy task. On the shelf systems are usually targeted for a specific application. The type of input they expect could be restricted. They usually have specific training protocols. They may not provide a digital speech input. A specific `harness' must be developed to input samples from the database.

Reference human test

In speech assessment, the human performance in achieving a well defined task is one of the most popular methods to evaluate the limits of automatic methods, to investigate the human approach to the solution of the problem (so that the automatic system may take advantage of this knowledge) and to score the human performance in comparison to "system" performance. For instance the classical HENR method to assess speech recognisers, and the well defined and established set of listening tests (DRT, MOS, etc.) used in assessment of speech output systems are well known techniques in speech assessment.

There is a wide literature about speaker recognition by human listeners, and only a small part of these experiments report on specific data dealing with speaker verification. Unfortunately for speaker recognition, no formalism has been established and the experiments reported in the literature usually do not have the same experimental conditions, so that results can not be easily compared.

In the field of speaker identification, the large amount of possible distinctive factors that should be managed in a listening session, such as number of speakers, duration of voice material per talker, voice familiarity, phonetic content of the speech material, delay between sessions, etc..., make the definition of a "standard" listening test a very difficult goal. Nevertheless, for speaker verification the problem becomes simpler, and reference may be found in [Rosenberg 73], or in [Federico 89].

Automatic / human tests

It is a fact that, in the past, effort in this field have been spent to answer the question of how accurate are the automatic methods for talker recognition compared to human listeners, i.e. to realise if the performance of an automatic system are acceptable, supposing that when system performance are similar to listener performance, they are acceptable. Further it has been set that automatic speaker identification is one area of speech processing where machines can exceed human performance. This is true when short utterances are used, or a great number of speakers, or when the voices are "unfamiliar". On the contrary it has been reported that humans appear to handle mimics better than automatic speaker verification systems. In fact [Reich 81], listeners seem able to associate the mimic voice with the intended one, but do not confuse them.

The comparison between automatic methods and human listeners usually ends up in the selection of a task reasonable for the automatic speaker identification and unfair to listener capability. For these reasons, we believe that results reported in the literature should be interpreted with considerable caution. In the meantime, more knowledge and experience about relationships between listener performance and automatic system performance would surely be very helpful both in the design and development of speaker recognition and in the assessment of these systems : a compelling step is now to define and test procedures for listening test in automatic speaker recognition, so that effort in this field will not vanish owing to a lack of reproductibility, or to multiple test conditions.

Given that listening tests are very time consuming and cost expensive research activities, it is not realistic to envisage such a procedure on every existing database. A good compromise would be to dedicate some effort to human calibration of standard databases. Nevertheless further research and experiments in the human test field is necessary, in order to fix possible standards and recommendations supported by both theoretical models and experimental results, as the listening methods would surely be helpful both in development and assessment of speaker recognition systems.

Preliminary experiments [Homayounpour 93] have been conducted in the framework of the SAM-A project, on the comparative evaluation of speaker verification by human listeners and by an automatic system (based on DTW). Tests were conducted on emotional and imitated speech. Both databases present difficulties to human and to machine. On these tasks, under similar conditions, human listeners still perform better than our reference system, but differences are not large.

Transformation of speech databases

Speaker recognition systems are sensitive to intra-speaker variability and to environmental factors, such as noise, transducer and channel characteristics, etc... In order to quantify this sensitivity, these factors should be modeled and varied independently. For each dimension, the limits of acceptable variation can be measured, providing a sensitivity profile for the system under test.

Goldman et al. [Goldman 93] have investigated the feasibility of using speech transformation techniques to find the limits of speaker verification systems. Voluntary modifications directed to mimic several emotional states or to mimic an other speaker are first tested, then artificial transformations are investigated. These preliminary tests are limited to the modification of some prosodic factors (namely duration and fundamental frequency), and to simulated vocal tract length modifications. Two prosodic features (duration and fundamental frequency) and an articulatory feature (simulated vocal tract length) are varied independently, and the consequences of such artificial modifications are evaluated in the framework of text-dependent speaker verification. Techniques to simulate Lombard effect and environmental factors were also investigated in the context of ASR assessment.

Accurate models of speech variability need to be investigated further. It must be based on data from large databases specifically designed to capture intra- and inter-speaker variability.

Next: Typology and assessment Up: Assessment of speaker Previous: Scoring Procedures

WWW Administrator
Fri May 19 11:53:36 MET DST 1995