Many of the issues raised in connection with experimentation
on segmentation (section ) are relevant here as should be
apparent. One point is the relationship between the training and
test data for setting up a baseline which illustrates, once again
the
notions of confound and bias Say a group of speakers is
available and the set includes a speaker who is markedly different
from the rest. If the recognisers are trained on a subset of
speakers and tested on the remainder, then when the atypical
speaker
is included in the test set, he will be excluded
from the test set and consequently performance would be reasonably
good even though the model trained is not particularly good because
it includes the atypical speaker. Conversely, when the
atypical
speaker is not included in the training set, the model will be
good. However, when the model is tested it will produce poorer
performance than previously due to the inclusion of the atypical
subject. Thus, it is possible to have better baseline
performance
for a poor model than a good model.
The prerequisite for assessing progress is an adequate measure of the errors the recogniser produces and how this reduces over time. Unfortunately, it is not a simple matter to derive a measure of error performance.
Some measures of recognition performance mix up errors of segmentation and classification. Thus, one list of the types of events that might occur when comparing a human judge's labels with a machine's are:
The thing to note here is that it is not possible to decide whether deleted and inserted phonemes are instances of segmentation or classification errors: A human judge might label a portion of speech as an affricate whereas the machine might indicate a plosive plus fricative. If the machine had used the same segment boundaries as the human, performance might have been equivalent.
The simplest type of error measure is the number of phonemes that the recogniser got correct compared with the number the human judges got correct. A basic (unresolved) question for this measure is that if humans cannot provide ``perfect'' classifications, the machine is being given noisy data. Specifically in connection with assessing, say accuracy of classification, the problem is what is the ``correct'' answer on phones that subjects are not agreed. This raises another issue specifically in connection with a particular technique that has been applied for assessing recognisers. The technique is signal detection theory and the technique will first be outlined before problems in applying it to assess recognisers (humans and machines) are discussed.
The basic idea behind signal detection theory is that errors convey information concerning how the system is operating (in this respect, it is an advance on simple error measures). In the signal detection theory model, it is assumed that there is a distribution of activity associated with the event to be detected (i.e., recognition of phoneme A). The recogniser is performing according to some criterion such that if activity is above the criterion, the recogniser reports that the phoneme is present and below the criterion, subjects report that the phoneme did not occur. Usually, the threshold is set so that most but not all activity associated with a signal leads to that phoneme being recognised. Activity associated with the signal distribution above the criterion threshold results in signals being detected (hit) and those below are ``missed''. Diagrammatically:
The abscissa is activity level and the distribution represents the probability distribution of events associated with the signal (phoneme A) at the various activity levels. The signals associated with other phones are ``noise'' in relation to the phoneme and they give rise to a distribution of noise activity which influences recognition. The noise distribution represents the probability distribution of activity levels and the criterion activity level is the same as that applied to the signal distribution. Most of the noise distribution on processes associated with good recognisers will be below the criterion but some will be above. When activity associated with the noise distribution below the criterion is encountered, subjects correctly reject this activity as being associated with phoneme A whilst when it is above this criterion, they incorrectly report a signal to have occurred --- referred to in signal detection theory as a false alarm. Diagrammatically again:
Since the criterion is at the same activity level in each case, the two figures can be combined to give the complete model of the recognition process:
The error classes described earlier are associated with the categories needed for a signal detection analysis as follows:
With the data available in
the form of frequency counts of these
categories, standard methods can be employed to ascertain (a) the
separation between the mean of the noise and signal distributions and
(b) the decision criterion that has been applied. These are referred to
as
d' and respectively and d' is particularly important in the
present context as it is a measure of the discriminability of the
signal distribution fron the noise distribution
which takes into
account all the error information available. A work sheet of the
calculations of d' and
and tables needed for the calculation are
included as Appendix 1.
The relationship between the two classes of error --- false alarms and misses --- has been considered an important indicator of performance. Judges, be they machines or humans, can trade between these two classes of error. The trading relationship is referred to as a Receiver Operating Characteristic (ROC).
The problem alluded to earlier in connection with SDT is what is signal and what is noise. In earlier work it has been assumed that human judges are capable of providing the ``correct'' answers. However, agreement between judges is notoriously low even for gross classifications (in the stuttering literature, inter-judge agreement about stutterings is as low as 60% for expert judges). The finer level of classification called for here would lead one to expect that agreement about phone classes would also be low. Possible ways out of this dilemna are (1) improvement in psychophysical procedures and (2) (related) normalisation procedures across judges to obtain some composite level of agreement.
Functional adequacy refers to the fact that recognisers have only
got to perform a limited range of functions (they might not be
required
to deal with unrestricted speech for example). User acceptance refers
to the fact that users might tolerate something that is not perfect.
Each of these topics calls for metrics other than percent correct
() and can involve subjective judgments on the part of subjects.
So, for these topics, it is necessary to consider what is the best way
to obtain
information from users about the acceptability of a system.
The recommended way of obtaining the information is in the form of a summated rating scale ( =1 (
; Likert 1932) ). These scales are constructed by preparing sets of statements designed to measure an individual's attitude about a particular concept (here, for instance, recogniser acceptance). Typically scales are comprised of several different sub-scales (in assessing user acceptance of a recognition system, these sub-scales might include response time, format of feedback to user, etc). Respondents indicate the extent to which they agree with each statement by giving a rating (usually between 1 and 5) indicating the extent to which they agree with each statement. In order to counterbalance for response biasses, it is usual to phrase questions so that, here for example, questions indicating affirmative user acceptance would lead to low rating responses for some questions and high ratings for others. Examples of questions and response format which might be appropriate for assessing user acceptance might be:
1 2 3 4 5
I found the system easy to use. ... Sometimes I experienced difficulties in using the system.
These two questions would tend to lead users to use different poles of the rating scale. During analysis, the scale values need to be reversed.
The advantages of Likert scales are:
A limitation is that some level of expertise and statistical sophistication is needed to develop and validate a good scale.
The construction of questionnaires based on Likert's
scale format
for the items of any identified concept will involve the following
steps: (i) Define the concept or set of sub-concepts to measure.
(ii) Design the scale.
(iii) Preliminary assessment of questionnaire.
(iv) Administration and item
analysis.
(v) Validate and produce norms.
Factor analysis may be used for two purposes for validating the scales:
; Joreskog Sorbom 1984) ) and ESQ ( =1 (
; Bentler 1985) ).
Assessment as part of the application will lead to speech being encountered that was not involved in setting up the material employed during training and testing.
It has been considered that perfomance of a newly-developed recogniser be compared against a reference algorithm (eg Chollet and Gagnoulet). The procedures for comparing performance between the reference and newly-developed algorithms would be similar, and encounter the same problems, as those described in connection with human-human and human-algorithm performance.
The procedures for calibrating databases rely in part on checking
that the sampling of the corpus is satisafactory (see section ),
or involve being able to compare performance against known answers (the
problems involved in providing them has
been described in section
).
Specially constructed data may need to be constructed in order to
test some specific ideas about why performance of the recogniser is
poor. This may involve difficulties in dealing with breathing noises,
hesitations, etc. or
because difficulty is experienced in recognising
particular phonemes or phoneme types. The construction of special data
for these purposes needs to bear in mind the concerns discussed above
in connection with providing adequate samples of speech
(section
).
There has been relatively little work on what psychophysical procedures are appropriate for assessing speech recognisers. Fortunately, there is a second area of investigation which involves similar questions --- measuring intelligibility using different procedures in the hearing impaired. It is straightforward to see how these techniques could be applied to speech synthesis where two or more synthetic versions can be compared with the procedures. In applying the procedures to speech recognition, two or more recognisers would have to produce output which is then assessed with the procedures. The outputs of the recogniser could be in the form of transcriptions which subjects would judge or converted to speech form using a text to speech system, and the tests could then proceed in exactly the same way as with synthetic speech.
In magnitide estimation, subjects choose a positive number to represent the subjective magnitude of the intelligibility of the output of different recognisers.
The rank order procedure, as its name suggests, would require subjects to place the different recogniser outputs in order of magnitude of increasing intelligibility.
In paired comparison as applied to recogniser assessment, subjects judge which of two recogniser outputs has more or less intelligibility.