voice disguise
bomb threat, kidnapping
denonciation
tapping telephone lines
One important field in which speaker recognition is usually involved, is the field of forensic applications. This concerns cases for which victims or witnesses have an auditory memory of the criminal voice and cases for which recordings are available. In the case of recordings, the content of the tapes may be important (who is speaking? what is being said? which sounds could be identified? ...). The tapes should usually be analysed to check whether or not they have been manipulated, erased, recorded over, .... The signal may require enhancement to be comprehensible. In some cases, the voice of a suspect should be compared with a recorded voice. This is a very special case of speaker verification with many specific difficulties:
Various methodologies for approaching this problem have been proposed ; they may substantially be classified into the following three categories according to the criterion adopted for the analysis and the characterisation of the voice signal : the listener method, the spectrographic method and the semi-automatic method.
In the listener method, the decision of similarity is taken by human experts after audition of speech samples. One of the methods is the repeated listening of the available samples by a group of experts looking for similarities in linguistic, phonetic and acoustic features. The voices listening test can be done either sequentially or alternatively or finally, by simultaneous presentation of the voices, if allowed by the instrumentation [Ibba 86]. The acoustic material is carefully prepared, selecting from the database suitable utterances of the unknown and of the reference speaker, eventually augmented with voice samples coming from speakers with the same acoustic-phonetic characteristics. These utterances are then compared two by two, inverting couples of voices from the same speaker. Studies have been carried out to investigate how well listeners can recognise speakers [Stevens 68]. Listener performance is a function of acoustic variables such as the signal to noise ratio, the speech bandwidth, the amount of speech material, complex distortions of speech signal introduced by speech coding transmission systems, etc... Several experiments using different texts for comparing and evaluating the performance under different degraded conditions showed that human listeners are robust speaker recognisers when presented with degraded speech. This is owed to the fact that there are many sources of knowledge that contribute in various ways to speaker recognition, providing weak, moderate or high discrimination power. However, as with any human decision process, it must be stressed that the listener method leads to a subjective decision. Nevertheless, the listener method is still used in some countries for forensic applications as a technique for speaker recognition.
The spectrographic method for speaker recognition makes use of an instrument that converts the speech signal into a visual display. For many years the reference instrument was the "Voice Identification Inc., Sound Spectrograph, model 700". This instrument is able to give a permanent record of changing energy-frequency distribution through time of a speech wave. Usually, frequency range is 0-4000Hz, and bandfilter is 300Hz. Because spectrograms are visual representations of the speech signal, they convey information about the message spoken by the speaker as well as about the speaker himself. For this reason, these patterns were thought to be used as a way of identifying speakers. For example, when the recordings of the voice of two individuals are obtained, an examiner may be able to give an opinion about the similarity between two recordings, if there are common phonetic elements between their speech recordings. This method for speaker identification was originally proposed in [Gray 44], but its use for forensic applications was not considered until 1962, when Kersta [Kersta 62] published the results of experiments on one-word spectral comparison in closed-set tests. Further studies were also carried out by Stevens [StevensĘ68] and by Tosi [Tosi 72], who presented the results of research at Michigan State University on the basis of a "forensic model" with open-set tests. These results have been analysed from Bolt [BoltĘ70], by observing the error rates of false identification and of false elimination. They observed that the error rate is dependent upon a lot of factors, i.e. different conditions of environment noise, the change in the psychological state of the speaker, his attempts to alter his voice, the recording conditions, the orthophonic or telephonic voice of the talker etc ...; in particular, the error rate is widely dependent on the examiner and is increased by changing from trained to untrained examiners. Owing to these factors and to other restrictive conditions that affect the error rate of examiners, this method is today of no great interest to scientists in speaker recognition tasks.
Both the listening and the spectrographic methods are subjective techniques based, the first one on aural comparison of recordings and the second one on visual examination of spectrograms in order to attribute two voice samples to the same talker. These decisions are taken by one or several experts, according to some process that is clearly impossible to formulate. Moreover, the process that leads to attribute to such or such individual the quality of "expert" is far from being calibrated itself. Very often also, experts are asked to give a probability of confidence on their own judgments. Are forensic speech experts submitted to a bench-mark tests before they are recruited ? Does it make sense to give a figure measuring one's own self-confidence on one's own decision ? There are certainly several issues about forensic applications of human speaker recognition that have to be put in question.
It is clear that some of the research carried out in automatic speaker recognition is targeted towards forensic applications. Most of the methods described in the previous chapter could be adapted to this goal, especially text-independent ones. However, if they offer a certain rate of success (under restrictive conditions), it must be stated clearly that they are far from being faultless, and that the consequence of an erroneous decision can be dramatic. Actually, if some expertises use very basic statistical methods (such as mean and variance computation), there does not seem to be any elaborate speaker recognition system publicly used for forensic applications, but the possibility is not excluded given the certain degree of secret around such applications. Nor it is excluded that experts resort to such systems for making their own decisions, with or without saying it explicitely. It is clear that such systems, whatever level they are used at, should be submitted to objective test protocols designed in agreement with some members of the scientific community. Frequent contacts between speech researchers and members of the Justice and Police institutions would also certainly help to clarify the possibilities and the limits to the use of speech and of speech technology in forensic applications.
For many applications of speaker recognition techniques, a wrong decision has only material consequences. For forensic applications, much more serious aspects are involved.
A database, specific to this area, should be carefully designed, recorded and distributed. It would consist for instance of a large collection of pairs of voice samples, sometimes from the same speaker, sometimes from different speakers. One of the recording could take place in a studio, while the other would be submitted to several kinds of environmental, channel, artificial and intentional distortions. It would be calibrated in such a way that the average auditor can not do better than a random choice.
Such a database would be extremely precious for several purposes. One would be to evaluate forensic speech experts and clarify if they can really do much better than an average auditor. It would also clearly evidence to the forensic professionals, what are the limitations of human and automatic techniques of speaker recognition for such applications.
Some initiative in that direction is currently taking place within the GFCP (Groupe Francophone de la Communication Parlee) of the SFA, but it would certainly be appropriate to start a wider scale action.