Next: Assessing speaker verification Up: Assessment methodologies and Previous: Statistical and experimental

Assessing recognisers

Baseline performance

Many of the issues raised in connection with experimentation on segmentation (section ) are relevant here as should be apparent. One point is the relationship between the training and test data for setting up a baseline which illustrates, once again the notions of confound and bias Say a group of speakers is available and the set includes a speaker who is markedly different from the rest. If the recognisers are trained on a subset of speakers and tested on the remainder, then when the atypical speaker is included in the test set, he will be excluded from the test set and consequently performance would be reasonably good even though the model trained is not particularly good because it includes the atypical speaker. Conversely, when the atypical speaker is not included in the training set, the model will be good. However, when the model is tested it will produce poorer performance than previously due to the inclusion of the atypical subject. Thus, it is possible to have better baseline performance for a poor model than a good model.

Progress

The prerequisite for assessing progress is an adequate measure of the errors the recogniser produces and how this reduces over time. Unfortunately, it is not a simple matter to derive a measure of error performance.

Some measures of recognition performance mix up errors of segmentation and classification. Thus, one list of the types of events that might occur when comparing a human judge's labels with a machine's are:

: Correct: Phoneme A occurred at that point according to the transcription and an A was reported during recognition at that point too.
: Mismatch: Phoneme A occurred at that point according to the transcription but something other than an A (B, C, D, ... etc.) was recognised.
: Deleted: Phoneme A occurred at that point according to the transcription but nothing was reported (not an A nor anything else). This is a special subclass of mismatches.
: Inserted: The transcription would lead one, say, to expect two phonemes to occur in some stretch of speech but three (including an A) were recognised. The remaining two phonemes can be aligned with the transcription so it appears that an A was inserted thing other than an A (B, C, D, ... etc.) was recognised as an A.

The thing to note here is that it is not possible to decide whether deleted and inserted phonemes are instances of segmentation or classification errors: A human judge might label a portion of speech as an affricate whereas the machine might indicate a plosive plus fricative. If the machine had used the same segment boundaries as the human, performance might have been equivalent.

The simplest type of error measure is the number of phonemes that the recogniser got correct compared with the number the human judges got correct. A basic (unresolved) question for this measure is that if humans cannot provide ``perfect'' classifications, the machine is being given noisy data. Specifically in connection with assessing, say accuracy of classification, the problem is what is the ``correct'' answer on phones that subjects are not agreed. This raises another issue specifically in connection with a particular technique that has been applied for assessing recognisers. The technique is signal detection theory and the technique will first be outlined before problems in applying it to assess recognisers (humans and machines) are discussed.

The basic idea behind signal detection theory is that errors convey information concerning how the system is operating (in this respect, it is an advance on simple error measures). In the signal detection theory model, it is assumed that there is a distribution of activity associated with the event to be detected (i.e., recognition of phoneme A). The recogniser is performing according to some criterion such that if activity is above the criterion, the recogniser reports that the phoneme is present and below the criterion, subjects report that the phoneme did not occur. Usually, the threshold is set so that most but not all activity associated with a signal leads to that phoneme being recognised. Activity associated with the signal distribution above the criterion threshold results in signals being detected (hit) and those below are ``missed''. Diagrammatically:

The abscissa is activity level and the distribution represents the probability distribution of events associated with the signal (phoneme A) at the various activity levels. The signals associated with other phones are ``noise'' in relation to the phoneme and they give rise to a distribution of noise activity which influences recognition. The noise distribution represents the probability distribution of activity levels and the criterion activity level is the same as that applied to the signal distribution. Most of the noise distribution on processes associated with good recognisers will be below the criterion but some will be above. When activity associated with the noise distribution below the criterion is encountered, subjects correctly reject this activity as being associated with phoneme A whilst when it is above this criterion, they incorrectly report a signal to have occurred --- referred to in signal detection theory as a false alarm. Diagrammatically again:

Since the criterion is at the same activity level in each case, the two figures can be combined to give the complete model of the recognition process:

The error classes described earlier are associated with the categories needed for a signal detection analysis as follows:

: Hits = correct
: False alarms = False + insertions
: Misses = Mismatch + deleted
: Correct rejections = total phonemes -- (correct + false + mismatch + deletions)

With the data available in the form of frequency counts of these categories, standard methods can be employed to ascertain (a) the separation between the mean of the noise and signal distributions and (b) the decision criterion that has been applied. These are referred to as d' and respectively and d' is particularly important in the present context as it is a measure of the discriminability of the signal distribution fron the noise distribution which takes into account all the error information available. A work sheet of the calculations of d' and and tables needed for the calculation are included as Appendix 1.

The relationship between the two classes of error --- false alarms and misses --- has been considered an important indicator of performance. Judges, be they machines or humans, can trade between these two classes of error. The trading relationship is referred to as a Receiver Operating Characteristic (ROC).

The problem alluded to earlier in connection with SDT is what is signal and what is noise. In earlier work it has been assumed that human judges are capable of providing the ``correct'' answers. However, agreement between judges is notoriously low even for gross classifications (in the stuttering literature, inter-judge agreement about stutterings is as low as 60% for expert judges). The finer level of classification called for here would lead one to expect that agreement about phone classes would also be low. Possible ways out of this dilemna are (1) improvement in psychophysical procedures and (2) (related) normalisation procedures across judges to obtain some composite level of agreement.

Functional adequacy and user acceptance

Functional adequacy refers to the fact that recognisers have only got to perform a limited range of functions (they might not be required to deal with unrestricted speech for example). User acceptance refers to the fact that users might tolerate something that is not perfect. Each of these topics calls for metrics other than percent correct () and can involve subjective judgments on the part of subjects. So, for these topics, it is necessary to consider what is the best way to obtain information from users about the acceptability of a system.

The recommended way of obtaining the information is in the form of a summated rating scale ( =1 (

; Likert 1932) ). These scales are constructed by preparing sets of statements designed to measure an individual's attitude about a particular concept (here, for instance, recogniser acceptance). Typically scales are comprised of several different sub-scales (in assessing user acceptance of a recognition system, these sub-scales might include response time, format of feedback to user, etc). Respondents indicate the extent to which they agree with each statement by giving a rating (usually between 1 and 5) indicating the extent to which they agree with each statement. In order to counterbalance for response biasses, it is usual to phrase questions so that, here for example, questions indicating affirmative user acceptance would lead to low rating responses for some questions and high ratings for others. Examples of questions and response format which might be appropriate for assessing user acceptance might be:

1 2 3 4 5

I found the system easy to use. ... Sometimes I experienced difficulties in using the system.

These two questions would tend to lead users to use different poles of the rating scale. During analysis, the scale values need to be reversed.

The advantages of Likert scales are:

All questions constituting a concept or sub-concept can be summed to give a composite response.
A quantitative rather than qualitative measure is provided.
They have no `right' answer, which makes it different from multiple choice format.
Likert scales are relatively cheap and easy to develop.
They are usually quick and easy for respondents to complete.

A limitation is that some level of expertise and statistical sophistication is needed to develop and validate a good scale.

The construction of questionnaires based on Likert's scale format for the items of any identified concept will involve the following steps: (i) Define the concept or set of sub-concepts to measure. (ii) Design the scale. (iii) Preliminary assessment of questionnaire. (iv) Administration and item analysis. (v) Validate and produce norms.

The literature needs to be reviewed to ratify the concepts identified and check whether others ought to be included. Care should be taken to ensure that these are clearly and precisely defined: A scale cannot be developed until it is clear exactly what is intended to be measured.
Scales will be defined for each item. This will be based on Likert's format where appropriate. The format is not appropriate for collecting information on some concepts (principally, demographic details). At this stage response choices will be specified and instructions formulated for the evaluative basis of each item. A pool of items will be generated at this step, which will be subjected to statistical analysis in later steps.
Preliminary assessment

Factor analysis may be used for two purposes for validating the scales:

Exploratory Factor Analysis. This is used for studying the mulitidimensionality of the Likert scales that underlie a concept. Two aspects of this are (a) to establish the number of factors that best represent the items, and (b) the interpretation of the factors.
Confirmatory Analysis Exploratory factor analysis provides an optimum statistical description of the data. However, scale construction is premised on certain assumptions about what the scale is intended to measure (e.g., response time). Confirmatory Factor Analysis (CFA), may be used to verify the latter hypothesised factor structure. This can be performed using one of the available covariance structure modelling programs such as LISREL ( =1 (
; Joreskog Sorbom 1984) ) and ESQ ( =1 (
; Bentler 1985) ).

Methodology

Application oriented

Assessment as part of the application will lead to speech being encountered that was not involved in setting up the material employed during training and testing.

Reference oriented

It has been considered that perfomance of a newly-developed recogniser be compared against a reference algorithm (eg Chollet and Gagnoulet). The procedures for comparing performance between the reference and newly-developed algorithms would be similar, and encounter the same problems, as those described in connection with human-human and human-algorithm performance.

Calibrated databases

The procedures for calibrating databases rely in part on checking that the sampling of the corpus is satisafactory (see section ), or involve being able to compare performance against known answers (the problems involved in providing them has been described in section ).

Manipulated, artificial and diagnostic databases

Specially constructed data may need to be constructed in order to test some specific ideas about why performance of the recogniser is poor. This may involve difficulties in dealing with breathing noises, hesitations, etc. or because difficulty is experienced in recognising particular phonemes or phoneme types. The construction of special data for these purposes needs to bear in mind the concerns discussed above in connection with providing adequate samples of speech (section ).

Experimental design

There has been relatively little work on what psychophysical procedures are appropriate for assessing speech recognisers. Fortunately, there is a second area of investigation which involves similar questions --- measuring intelligibility using different procedures in the hearing impaired. It is straightforward to see how these techniques could be applied to speech synthesis where two or more synthetic versions can be compared with the procedures. In applying the procedures to speech recognition, two or more recognisers would have to produce output which is then assessed with the procedures. The outputs of the recogniser could be in the form of transcriptions which subjects would judge or converted to speech form using a text to speech system, and the tests could then proceed in exactly the same way as with synthetic speech.

WWW Administrator
Fri May 19 11:53:36 MET DST 1995