Next: Forensic applications Up: Assessment of speaker Previous: Complementary assessment tools

Typology and assessment of applications

The speech signal conveys a message (the linguistic information) originating from a source (a human speaker or a synthesizer) and transmitted through a channel to be understood by the receiver. These four aspects should be kept in mind during the evaluation of a speech application. The paralinguistic information conveyed by the speech signal is and will be exploited in many practical applications. In some cases, it is used to improve an other aspect of speech communication (as in speech recognition and speech synthesis). In other cases (speaker verification/identification, speaker change detection, pathology detection, age/sex/language identification, ...), it is the essence of the application.

Let us list a number of application areas:

Speech recognition (accent (regional, foreign), speaking habit, speaker classification, adaptation, intent of the speaker, speaking style, change in speaking rate
Speech synthesis: increase naturalness, offer the possibility of changing the speaker characteristics, convey emotion
identification/verification of identity
change of speakers
speech pathology detection and evaluation, physical and psychological state
pronounciation training and reeducation
...

It is unrealistic to review all possible applications, many of which being unknown at this time. What is attempted here is a classification of potential applications with a main focuss on speaker verification.

The scientific interest here is to assess how reliable the paralinguistic information is for the intended application.

Let us illustrate our argument with the simple example of the `lie detector', a system which could be capable of detecting when, on purpose, a speaker is not saying the truth. Such systems have been implemented and sold. It is a commercial application. It is doubtful that any objective evaluation of such a system has been conducted. It could even be argued that such a system should obtain approval from an independent (standardisation) institute prior to commercial exploitation. Such administrative and social issues can only be mentioned in this handbook. The responsibility of the scientific community should be limited to providing the tools and methods to assess the performance of systems. But, guidelines and recommandations should also be made about the feasibility of applications given our current knowledge.

These issues will be revisited in section XXX dedicated to forensic applications (legal and police investigations)

We are convinced that several approaches to the problem of speaker recognition are now efficient enough to bring advantages to some real-world applications. It is a matter of putting together a reasonably efficient algorithm, a robust decision strategy and a carefully designed ergonomy, into a well-engineered system. As an additional layer of security, voice verification can be effective in several applications, even if not 100 % reliable (personal identification numbers are not 100 % sure either). Beside its dissuasive effect, it is clear that it can reduce some significant part of unauthorized access or transactions, without being perceived as an hindrance by the conventional customer. Multi-modality will extend progressively its action field in the everyday world, and voice recognition will undoubtedly have a role to play in this development.

It is often stressed that the advantages of text-independence come from 2 sources: for machine-driven text-independent systems, if the speaker can not predict what he has to say, an impostor can not use any pre-recording of the true speaker's voice to cheat the system ; for user-driven text-independence, speaker recognition remains possible even if only recordings of the speaker are avalaible, or in case the speaker is non-cooperative. Note incidentally that, in case of security applications, the first argument is only valid if some kind of speech verification is coupled with the speaker recognition technique, to check that the speaker is indeed saying what he is supposed to say.

Recommendations

Applications and products in speaker recognition

In this chapter, we list several potential or existing applications of speaker identification and verification technology, and we give an overview of the existing products that we are aware of.

We distinguish remote and local applications. Remote applications are typically performed over the telephone. Telephone sets and channel paths present quite diverse characteristics. The microphone and the environment is much better controlled with local applications.

needs of the user, alternatives

advantages to the user/to the service provider

cost

the length of the speech material necessary for training and testing phases, the computation complexity, the number of speakers used for evaluation, etc... However, these factors are not structural characteristics of the system.

Telecommunication (remote) applications and products

One of the most obvious use of speaker recognition techniques is caller authentication over the telephone network. In the framework of such applications, the main task is therefore speaker verification. The user claims his identity, most of the time by dialing (or saying) a personal code number. Then, either a code word is required to authentify the speaker, or his utterance of the code number is used for verification purposes.

Typical application areas are:

banking (checking balance, transfer of funds),
credit cards. Remote payment by credit card is usually accepted by providing the name of the owner, the card number and the expiration date (all this information is printed on the card)
teleshoping (grant autorisation to tranfer money)
stock exchange (purchasing, selling stocks)
home incarceration
military applications
calling cards (SPRINT), billing your own account
access to/modification of information on remote servers (restrict access to authorized users).
access to computing facilities from a remote terminal (coupled with encryption). Passwords are currently used. Alternatives necessitate non-standard equipment (card reader, scanner, electronic pen, encrypted key, ...)
...

The main applications for speaker verification over the telephone network are of two kinds : the first kind is banking and remote transactions, the second kind is access to licensed databases. It is obvious that both fields do not require the same level of security : it is usually less costly to let someone unauthorised have access to a database, than to allow somebody to operate some transaction that can involve large amounts. However, in many countries, it is possible to pay by credit card over the phone, without any other verification of the customer than the consistence between the customer's name, his credit card number and its expiry date (all of them being on the credit card itself !). Some kind of basic speaker verification (for instance user-specific text-dependent verification), even with tolerant thresholds, could certainly bring more security to this type of transactions. Note that, in this case, the speaker characteristics should be centralised in a place that delivers the transaction authorisations, and would require a special equipment for each supplier, but the verification could take place off-line because the confirmation of the transaction does not usually need to be immediate.

Some telephone operators provide a toll free service allowing customers to place a phone call from any terminal. The charge for the call will appear on the phone bill of the user. For such an application, the user must be identified. A personal identification number (PIN) could be requested. This information could be entered via the dial pad. Such a protocol lacks security as some people are chasing valid PINs to use or sell them. The SPRINT operator in the USA proposes a successful alternative. They use speaker verification. Upon a call to the service, the user is requested to say the sequence of digits of his account. Digit recognition and speaker verification is performed on this sequence. In case of doubt, he is then prompted a random sequence of digits for further validation (getting some confidence he is not playing back some prerecorded speech). He is then allowed to dial the phone number of the person he wants to reach.

>>>>>(je n'ai pas souvenir des details precis de ce protocol! Could you help?)<<<<<<<<<<<

The SPRINT operator has registered 1.5 milion customers for this service. In such an application, a rather high false acceptance rate is tolerable. Impostors are aware that their voice is being recorded. The introduction of speaker verification proved to be dissuasive against fraud. With a rather small increase in the complexity of the access protocol, a satisfactory level of security was achieved. Of course, a higher degree of security could be reached by letting the user dial by voice. The risk of recognition error (although rather small as recognition could be performed in a speaker dependent mode) would necessitate confirmation and a longer dialogue (paid by the service provider!).

Speaker verification over the telephone line in Germany is mainly done at SIEMENS Munich. At the end of 1990, the necessary database has been evaluated. The speech was collected from calls over analog and ISDN lines [JoachimÊ90].

In the UK, the BT home banking trial with the Royal Bank of Scotland used speaker verification technology to provide an additional layer of security to PINs (Personal Identification Number). The concern of offending as few genuine customers as possible was very relevant to the bank.

VOCALIS Ltd. (at that time part of LOGICA) developed a real-time telephone-based text-dependent system in 1988. The system's decision making logic has 4 layers of authentication prior to making the final decision on the claimed identity of the speaker. Individual password and PIN number are used as the first 2 layers of authentication. A number of words from pre-defined vocabularies are also chosen randomly at the later stages of verification. The enrolment process follows a robust strategy, in order to guarantee that the reference template is relevant. Unlike many similar products, VOCALIS' speaker verification system is developped for UNIX platform and makes use of embedded hardware. The application is fully integrated into the company's communications infrastructure. An extensive field trial of the system was organised in conjunction with a telephone service company in the US, where over 400 people from 4 different geographical groups registered with the system. The system allowed users to access US Government databases out of office hours. All registered participants were encouraged to make at least 10 calls over a 3 month period, using as many handsets as possible. The registered speakers were also asked to make 2 impostor calls, and were purposedly provided with information to break the first 2 layers of authentication. At the end of the trial, more than 4000 calls were logged. A detailed analysis of the results showed a 5 % equal error rate purely for the verification algorithm. None of the participants managed to break into the system as impostors when all layers of authentication were active. A similar speaker verification system is available also on the Callserver platform still from VOCALIS.

In the UK also, the Barclays Bank uses a speaker verification system.

Voice Control System have put speaker verification in their speech recognition system over the telephone line. It works on several platforms including PC and VME. By August 1990 [HuntÊ91] the performance was of 1% of false acceptance, and 2% of false rejection.

A speaker verification facility is also part of the AT&T HuMaNet teleconferencing system [BerkleyÊ90] that works via ISDN and uses an AT&T digital signal processor DSP32C board on a Personal Computer.

The CNAPS of Adaptive Solution Inc. works over the telephone line and gives both speaker verification and speaker identification capabilities [SkinnerÊ92] ; averaging the true acceptance and true rejection rates the system reaches 9 5

ENSIGMA Ltd. produces a speaker verification system named VERIFIER [Moody 91], that runs on the Loughborough AT&T DSP32C telephone board. It works both over the telephone line and audio line. Ensigma claims full portability of the speaker verification software on other boards. The system uses a new approach to the speaker verification problem. In fact, instead of simply matching a speaker's voice against the user's template, ENSIGMA's system compares the spoken word also against a "general world-model". The decision is made taking into account the results of these two comparisons. This method has been evaluated over the telephone line and gives a 1% of false rejection and 1% of false acceptance. The method seems also quite robust to background noise. The enrolment consists of the new user repeating a series of digits ; the duration of this phase is about one minute. The access consists of the pronunciation by the user of a list of (random) digits asked by the system. System response takes about 5 seconds. The VERIFIER has many properties that make this system a good candidate as a possible reference-hardware system for (text-dependent) speaker verification. The multi-lingual aspect, or generally speaking the vocabulary problem, may be solved only if procedures to build the "reference world model" used to score utterances similarity is released by the producer.

On-site applications

Typical applications are:

control of / access to an equipment
access to a secured area (nuclear plant, military)
voice key (home, car)
mobile telephone, personal assistant (only respond to the voice of his owner)
Automatic Teller Machine (ATM)
...

With on-site applications, the verified person must be physically present in one particular place. It covers typically access control applications and fixed-place banking services (Automatic Teller Machine). These applications are the equivalent of database access and remote transactions over the telephone. However, the differences with telecommunication applications come from 4 facts : the environment factors and the signal bandwidth can be more easily controled, the automatic verification can send an alarm in case of doubt, the customer can bear his voice characteristics on him (on an intelligent card, for instance), and the voice verification technique can be associated more easily to additional identity verification (multimodal) techniques.

access control: open-set speaker id

alternatives:

key, badge, code (could be stolen)

finger print, face recognition, hand shape, signature, retinal scan

make sure that only one user gets access (2 doors, scale)

defeat prerecorded messages (text prompted)

A possible implementation of a voice verification system for money distributors can simply consist of a voice verifier which refuses the transaction (or limitates its amount) if the voice characteristics do not match sufficiently the identity corresponding to the PIN number. Note that, even if a thief steals a credit card and that the PIN number is attached to it, he may still not know the voice of the user, and may have difficulties to get information about it. If he knows the voice of the owner, he may still not be able to imitate it. Even if some false acceptances take place, the amount of fraudulous transactions is necessarily reduced. However, if too many false rejections occur, the bank may loose some of their clients. The probability of an impostor to have a voice similar to the user being small, the system can be quite tolerant and still reduce the number of fraudulous withdrawals, be somehow dissuasive to impostors, without offending a significant part of the regular users. If additional procedures are put into action, such as taking a picture of the user in case of doubt on his identity, or performing some kind of face recognition, dissuasion is reinforced, since it requires a more elaborate and less casual strategy to be resorted to by a possible impostor. For this kind of applications, the voice characteristics of the speaker can be stored on the magnetic tape or the chip of his card, which does not require a centralised storage. The verification must take place in nearly real-time to avoid undesirable delays in the transaction.

For what concerns access control, a reasonable system would be based on a rather strict verification, with a "call for assistance" alternative in case of rejection.

The system of Doddington [DoddingtonÊ74], located at Texas Instruments, is the first example of access control via speaker verification. It is a vocabulary-dependent system, and speech is parameterised by 14 filter-bank coefficients. Test utterances are composed of a sequence of 4 monosyllabic words. During the training phase, a key-point corresponding to the maximum of energy in each word, is detected, and the spectral vector corresponding to this key-point is stored as the reference of the whole word. It is easy to understand that this key-point is nothing else than the center of the vowel in the monosyllabic word. During the test phase, a simplified version of dynamic programming is implemented to simultaneously locate the 4 key-points in the input utterance and to evaluate the minimum distance corresponding to the best match. The elementary distance measure for the DTW is the Euclidean distance. In real-world conditions, with an appropriate multiple verification strategy, Doddington [Doddington 85] reports in 1985, for the latest version of the system, less than 1 % for both false rejection and false acceptance rates. The operational system is located in a reasonably quiet environment, but the signal is band-passed between 300 Hz and 3000Hz. It is described in details in the previous chapter.

MSS

access to nuclear plants

Until 1990, in France, the available technology was provided by the LIMSI research laboratory [Sorin 90]. An industrial prototype of the SESAME system was expected by the VECSYS company. The SESAME experimental system has been used for about three years at the entrance of the LIMSI building in ORSAY.

Other applications

Among other possible applications of speaker recognition techniques in general, speaker change detection can be used for automatic speaker labeling in recordings or focussing on the actual speaker in videoconferences. For some multi-media applications, it may be very useful to access, speaker by speaker, the recording of a conversation or of a radio or television program. Here, the speech to be processed is reasonably contemporaneous, which makes the algorithms usually more efficient, but it may happen that several speakers talk at the same time.

Finally, we repeat here that research in speaker recognition can be used for speaker selection or adaptation of multi-speaker or speaker-independent speech recognition systems, and can bring improvements to speech processing techniques in general, including synthesis and coding.

Alternative techniques

Security may be achieved in several ways, and voice is only one of the possibilities. We give here a brief overview of the alternative techniques.

The physical supports to protect information or access to unauthorised persons are usually classified in three main families : physical-object elements (key, badge, etc...), information elements (password, combination, etc...) and physical-personal elements (finger-print, voice, DNA, etc...). The ensemble of systems based on this last family are called "biometric verificator" as they are directly tied to some biological characteristics of the (authorised) person. It is a trivial fact that high security systems must use a method based on the union of these three main families. The biometric family is divided into two sub-familiesÊ: one based on physical characteristics of the person that do not change (unless crude events) in the range of few years ; the other based on behaviour characteristics that may change in relation to humour, ambient, physical state, and so forth. Because of the high rate of variability of the last sub-family, corresponding systems are quite difficult to achieve, especially in automatic mode, without an expert supervision. Just to clarify these two sub-families of biometrics verificators let consider the following list of biometrics verificators:

There is not a unique strategy to select the most appropriate biometric verificator in a given task. Nevertheless, there is a list of properties that must always be considered in designing a security system. Among the most important, we find :

We now review some commercial systems that use biometric verificators, with a short description of how they work and their performance.

The IDENTIX of Sunnyvale (California) has developed an automatic system based on finger print acquisition and recognition. After the insertion of a badge, the user had to press his finger in a concave hall, and the finger print is scanned using infra-red rays. Response is given in about 5 seconds. False acceptance error is negligible (zero on 200 trials), while false rejection is quite high, about 10%.

The AUTOSIG SYSTEM of Irving (Texas) produces a system named SIGN/ON. It is a special tablet that records 13 different parameters (pressure, speed, etc.) while the user is writing his signature on a simple sheet of paper positioned on the tablet (of course the user had to use a special pen). Response is given in about 15 seconds depending on the length of the signature. False acceptance on a single trial is 9%, on two trials (or more) error is about 2%.

The ID3D made by RECOGNITION SYSTEM of San Jose (California) is a hit in recognition systems, as it shares many of the properties (say advantages) described above ; it is in fact a low cost system, with a high degree of integration, fast to use, with a high rate of correct recognition and also with a high user acceptability. After the input of the PIN (using a badge, or typing it on a keyboard) user is requested to place his hand on a tablet containing four rungs. When the hand is in the right position, a light is turned on signalling to the user that the infra-red scan is started. No information about the algorithm used in this system is available to our knowledge : we only know that the information is coded on a 9 bytes value, and that the equal error rate has been evaluated about 0.2%, a very impressive value.

The EYEIDENTIFY Co. of Portland (Oregon) has produced several systems based on the retinal scan using infra-red rays. The user is requested to look inside a special device until he focuses a specific figure. Trained persons take about 7 seconds to perform the measure. Glasses do not affect the measure, while contact lenses can not be used. Error rate is negligible, but user acceptability is very low.

Some comparisons between voice based systems and alternative systems have been carried out. Together with the previous systems the SANDIA laboratories of Albuquerque (New Mexico) James [JamesÊ90] has evaluated two speaker verification systems : the VOXTRON (or Ver-A-Tel) by ALPHA MICROSYSTEM of Santa Clara (California), and the VOICE by ECCO INDUSTRIES of Danvers (Massachusetts). Both systems have almost the same rate of correct recognition (worse than the other systems). From the SANDIA report the voice systems have a high rate of (social) acceptability so as the ID3D, but this last system is preferred as more easy to use.

Generally speaking if we are interested only in performance, a broad table resuming the value for different biometrics verificators may be the one reported by Peckham [Peckham 90]:

But as noted in the same report, field tests carried out for the US government on speech, signature and fingerprint, resulted in a recommendation to use speech. Note also that the alternative techniques listed above are more appropriate for on-site applications, and that speech remains a unique biometrical identification feature over telecommunication channels, at least until videophone and ISDN become wide-spread. Even so, some of the alternative techniques could only be implemented through individual sensors, which limits their applicability.

Conclusions

Several commercial systems using speaker recognition techniques, are now avalaible on the market, mostly for speaker verification purposes. Similarly to existing algorithms, it is quite difficult to compare them objectively. This fact stresses the need of standard evaluation procedures for these systems : one possible approach is the definition of a standard application. Such a procedure should specify very accurately the nature of the application, and integrate the speaker verification layer into a predefined structure of authentication layers. It is indeed clear from several examples, that the error rate on the pure voice verification algorithm can have very different overall consequences over the global score of an application, depending on the way the voice verification is integrated with other verification layers. A drawback of such an approach is that the application to which the system is targeted may be very different from the reference application, and the evaluation figures may become somehow meaningless.

From the examples above, it is quite evident that voice verification can represent an advantageous additional layer for user authentication. It is also clear that the success of such an implementation relies on an adequate design and engineering of the application, including multiple layer decisions, error recovery strategies and ergonomical considerations. At this stage, additionnal alternative techniques can also be integrated, for increased security. The future outcome of voice verification techniques from a commercial point of view is certainly strongly connected to a reflexion in application design.

To conclude, voice verification technology is certainly at a stage where its future use is partly conditioned to the ability of the suppliers to convince their clients and the customers of their clients of the advantages that can be gained from such techniques. This conviction will only be reinforced by well-targeted application engineering.

Recommendations

Next: Forensic applications Up: Assessment of speaker Previous: Complementary assessment tools

WWW Administrator
Fri May 19 11:53:36 MET DST 1995