Next: Assessing recognisers Up: Assessment methodologies and Previous: Introduction

Statistical and experimental procedures for analysing data corpora

In the first part of this section, some fundamental ideas in statistics will be illustrated through selected LES examples. In the subsequent parts, the steps in the scenario that has just been run through will be recapped to consider those issues that cannot be addressed until all components introduced are understood. The issues that involve experimentation will then be dealt with.

Statistical analysis

Statistics is the acquisition, handling, presentation and interpretation of numerical data. Language engineers have considerable experience acquiring and presenting data, but less in interpretation. The following will, then, be concerned with the latter.

Population/s, samples and other terminology

A population is the collection of all objects that are of interest for the task in hand. In the earlier example, all members of the country are the population. Here everyday use of the term population corresponds with its use in a statistical sense. Though population in a statistical sense can have the same meaning as the geographical sense, it need not necessarily have. Thus, for instance the population of users of a speaker verification system of a bank would only comprise the clients of the bank. Population does not only refer to humans --- for example, the population of /p/ phonemes of a speaker would be all of the instances of that phoneme a speaker ever produces.

A variable is a numerical value associated with each unit of the population. Variables are divided into independent and dependent variables. An independent variable is one that is controlled or manipulated by the experimenter. So, for example, when setting up a corpus, the experimenter might deem it necessary to ensure that as many females are recorded in the test data as males. Sex would then be an independent variable (independent variables are also referred to as factors, particularly in connection with the statistical technique Analysis of Variance discussed later). A dependent variable is a variable that is measured on each experimental unit (each speaker in the preceding example) and the investigator is interested whether its value is affected by independent variables. Thus if fundamental frequency is to be the dependent variable, it is highly likely that it will be affected by the independent variable of speaker's sex.

When a variable is measured on all units of a population, a census has been taken. If it were always possible to obtain census data, there would be no need for statistics. However, since most language engineering applications (and, indeed, in many other aspects that require measurement), involve virtually infinite populations (such as those illustrated earlier of speakers or phonemes), it is not possible to measure variables on all units: In these circumstances, a finite sample is taken. This sample is used to study the variable of concern in the population. So, if you wanted an idea of the average voice fundamental frequency of men, you might make measurements on a sample of 100 men. This sample is then studied as if it is representative of the population. The statistician is able to offer an idea about the relationship between variables measured on the sample (here its mean) and, what the investigator is really interested in, the mean voice fundamental frequency of the population.

Sampling

The main problem in treating data statistically is how to ensure that you can obtain reliable information about the population from a sample. For this, the sample must be representative. The main requirement to achieve this is to take a simple random sample: A sample is a simple random sample if every member of the population has the same chance of being selected as every other member. Thus, if in setting up the ANN recogniser, speakers from the lab are used to obtain training data, the sample would not be simple random: It is unlikely that the members of the lab give females, all social strata, or people outside working-age a fair chance of being selected.

Biasses

Selection of a sample that is not a simple random sample is one of the main sources of bias in experiments. Biass can be defined as a systematic tendency to misrepresent the population. So, if the ANN recogniser is intended to be used by all members of the population, you cannot select an unbiassed sample of speakers from a sample of speakers recorded just between 9 a.m. and 5 p.m. This would exclude speakers who are at work; so, if you do this, you have obtained a biassed sample which is not necessarily representative of the target population.

If you take a sample, how sure can you be that if you measure variables such as the mean of the sample is close to the mean of the population? This sort of problem is termed estimation and is considered in the following sub-sections.

Estimating sample means, proportions and variances

Estimation is used for making decisions about populations based on simple random samples. A truly random sample is likely to be representative of the population; this does not mean that a variable measured on a second sample taken will be the same as the first. The skill involved in estimating the value of a variable, is to impose conditions which allow an acceptable degree of error in the estimate without being so conservative as to be useless in practice (an extreme case of the latter would be recommending a sample of the same order of magnitude as the population). The necessary background skill is to understand how quantities like sample means, variances and proportions are related to their counterparts in the population.

Estimating means

A fundamental step towards this goal is to relate the sample statistic to a probability distribution: What this means is: if we repeatedly take samples from a population, how do the variables measured on the sample relate to those of the population? To translate this to an empirical example: How sure can you be about how close your sample mean lies to the population mean? Even more concretely, if we obtain the mean of a set of samples, how does the mean of a particular sample relate to the mean of the population. As has already been said, the value of the mean of the first of two samples is unlikely to be exactly the same as the second. However, if repeated samples are taken, the mean value of all the samples will cluster around the population mean. The fact that when a number of samples are taken, the values of the mean of all the samples cluster around the same value as the mean of the population is usually described as the mean is an unbiased estimator of the population mean.

The usual way this is shown is to take a known distribution (i.e., where the population mean is known) and then consider what the distribution would be like when samples of a given size are taken. So, if a population of events has equally likely outcomes and the variable values are 1, 2, 3 and 4, the mean would be 2.5. If all possible combinations are taken (1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3 and 4), the mean of the mean values for all pairs is also 2.5 (taking all pairs is a way of ensuring that the sample is simple random). An additional important finding is that if the distribution of sample means (the sampling distribution) are plotted as a histogram, the distribution is no longer rectangular but has a peak at 2.5 (1 and 4, and 2 and 3 both have a mean of 2.5 and none of the means of the other pairs has the same mean). Moreover the distribution is symmetrical about the mean and approximates more to a normal (Gaussian) distribution even though the original distribution was not. As sample size gets larger the approximation to the normal distribution gets better. Moreover, this tendency applies to all distributions, not just the rectangular distribution considered (the tendency of large samples to approximate the normal distribution is, in fact, a case of the Central limit theorem). This particular result has far-reaching implications when testing between alternative hypotheses (see below). As a rule of thumb (Recommendation 1) sample sizes of 30 or greater are adequate to approximate the normal distribution.

Another important aspect of the situation described is that the sample means have a standard deviation ( sd ). The sd of the sample means (here the sd of all samples of size two for the rectangular distribution) is related to the sd of the samples in the original distribution by the formula:

This quantity is given a particular name to distinguish it from the sd --- it is called the standard error ( se). This quantity is used in the computation of another quantity, the z score of the sample:

The importance of this quantity is that the measure can be translated into a probabilistic statement relating the sample and population means. Put another way, from the z score, the probability of a sample mean lying that far from the population mean can be computed.

To show how this is used in practice: if a sample of size n = 200 is taken, what is the probability that the mean is within 1.5 ses of the population mean? Normal distribution tables give the desired area directly as 0.8664 (see Appendix 1). Thus, approximately 86.6% of all samples of size 200 will have means within 1.5 ses of the population mean. If, as in any real experiment, one sample is taken, we can assign a statement about how likely that sample is being within the specified distance of the mean.

Before leaving this section, it is necessary to consider what to do when wanting to make corresponding statements about small-sized samples which cannot be approximated with the normal distribution. Here computation of the mean and sample error proceeds as before. Since the quantity z is used in conjunction with the normal distribution tables, it cannot be used. Instead the analogous quantity t is calculated:

The distribution of t is dependent on sample size and so (in essence) the t value has to be referred to different tables for each size of sample. The tables corresponding to the t distribution are usually collapsed into one table and the section of the table used is accessed by a parameter related to the sample size (the quantity used for accessing the table is n-1 and is called the degrees of freedom). Clearly, since several different distributions are being tabulated, some condensation of the information relative to the standard normal deviates is desirable. For this reason, t values corresponding to particular probabilities are given. Consideration of t tables emphasises one of the advantages of the central limit theorem can be seen insofar as one table can be used for a wide variety of issues rather than is the case for t.

Estimating proportions

Here the problem faced is similar to that with means: A sample has been taken and the proportion of people meeting some criterion and those not meeting that criterion are observed. The question is with what degree of confidence can you assert that the proportions observed reflect those in the population? Once again the solution is directly related to that discussed when estimating how close a sample mean lies to the population mean using z scores. In essence the z score for means measures:

The only difference here is that binomial events are being considered (meet/not meet the criterion). Since the mean of a binomial distribution is np (number tested x population proportion) and the se is where q = 1-p), the z score associated with a particular sample is:

Normal distribution tables can again be used to assign a probability associated with this particular outcome.

To illustrate with an example: Suppose that it is expected that as many men will use the ANN system as will women (p=0.5). What size of sample is needed to be 95% certain that the proportion of men and women in the sample differs from that in the population by at most 4%?

Solving for n gives 600.25. Therefore, a sample of size at least 601 should be used.

Now what are the effects if we want to be more than 4% confident, say if the difference is reduced to 2%. The required sample size jumps to 2401.

Estimating variance

The relationship between the variance of a sample and that of the population is distributed as with n-1 degrees of freedom.

Thus, if we have a sample of size 10 drawn from a normal population with pop variance 12, the probability of its variance exceeding 18 is:

This has associated with it 9 degrees of freedom. Because squared values are only tabulated for particular probabilities (as with t), the probability can only be estimated for limited probabilities. In this case lies between 0.2 and 0.1.

Ratio of sample variances

If two independent samples are taken from two normal populations with variance and , the ratio of the two variances ( and ) has the F distribution:

If the two samples (which can differ in size) from the same normal population are taken, then the ratio of the variances will be approximately 1. Conversely, if the samples are not from the same normal population, the ratio of their variances will not be 1 (the ratio of the variances is termed the F ratio). The F tables can be used to assign probabilities that the sample variances were or were not from the same normal distribution. The importance of this in the Analysis of Variance will be seen later.

Hypothesis testing

Simple hypothesis testing

Many applications in Language Engineering require testing of hypotheses. An example from the scenario given in section was testing whether there were differences between read and spontaneous speech with respect to selected statistics. If the statistic was mean tone unit duration in the two conditions where speech was recorded, we have a situation calling for simple hypothesis testing. This situation is called simple hypothesis testing since it involves a parameter of a single population.

Following the approach adopted so far, the concepts involved in such testing will be illustrated for this selected example. The first step is to make alternative assertions about what the likely outcome of an analysis might be. One assertion is that the analysis might provide no evidence of a difference between the two conditions. This case is referred to as the null hypothesis (conventionally abbreviated as H0) and in verbal form for the current case asserts that the mean tone unit duration in the read speech is the same as that in the spontaneous speech.

Other assertions might be made about this situation. These are referred to as alternative hypotheses. One alternative hypothesis would be that the tone unit duration of the read speech will be less than that of the spontaneous speech. A second would be the converse, i.e. the tone unit duration of the spontaneous speech will be less than that of the read speech. The decision about which of these alternate hypotheses to propose will depend on factors that lead the Language Engineer to expect differences in one direction or the other. These instances are referred to as one-tailed ( one-directional) hypotheses as each predicts a specific way in which there will be a difference between read and spontaneous speech. If the Language Engineer wants to test for a difference but has no theoretical or empirical reasons for predicting the direction of the difference, then the hypothesis is said to be two-tailed. Here, large differences between the means of the read and spontaneous speech, no matter which direction they go in, might constitute evidence in favour of the alternative hypothesis.

The distinction between one and two-tailed tests is an important one as it affects what difference between means is needed to assert a significant difference (i.e., support the null hypothesis). In the case of a one-tailed test, smaller differences between means are needed than in the case of two-tailed tests. (Basically, this comes down to how the tables are used in the final step of assessing significance (see below). There are no fixed conventions for the format of tables for the different tests, so there is no point in illustrating how to use them. The tables usually contain guidance as to how they should be used to assess significance.)

Hypothesis testing involves asserting what level of support can be given in favour of, on the one hand, the null, and, on the other, the alternate hypotheses. Clearly no difference between the means of the read and spontaneous speech would indicate that the null hypothesis is supported for this sample. A big difference between the means would seem to indicate that there is a statistical difference between these samples if the direction in which the means differs is in the same direction as hypothesised for a one-tailed hypothesis or if a two-tailed test has been formulated. The way in which a decision whether a particular level of support (a probability) is provided is described next.

In the read-spontaneous example that we have been working through, we are interested in testing for a difference between means for two samples where, it is assumed, the samples are from the same speaker. The latter point requires that a related groups test as opposed to an independent groups test is required (see Figure ). In this case, the t statistic is computed from:

Thus if the read speech for 15 speakers had a mean tone unit duration of 40.2 centiseconds and the spontaneous speech 36.4 centiseconds and the standard deviation of the difference between the means is 2.67, the t value is 1.42. The t value is then used for establishing whether two sample means lying this far apart might have come from the same (null hypothesis) or different (alternate hypothesis) distributions. This is done by consulting tables of the t statistic using n-1 degrees of freedom (here n refers to the number of pairs of observations).

In assessing a level of support for the alternate hypothesis, decision rules are formulated. Basically this involves stipulating that if the probability of the means lying this far apart assuming that the samples are from the same distribution, is so low that a more likely alternative is that the samples are drawn from different populations. The ``stipulation'' is done in terms of discrete probability levels and, conventionally if there is a less than 5% chance that the samples were from the same distribution, then the hypothesis that the samples were drawn from different distributions is supported (i.e. the alternative hypothesis at that level of significance). Conversely, if there is a greater than 5% chance that the samples are from the same distribution, the null hypothesis is supported. In the worked example, with 14 degrees of freedom, a t value of 1.42 does not support the hypothesis that the samples are drawn from different populations, thus the null hypothesis is accepted.

It should be noted that support or rejection of these alternative hypotheses is statistical rather than absolute. In 1/20 (5%) cases where no difference is asserted, a difference does occur (referred to as a Type II error, accepting the null hypothesis when it is false) and in cases where a 5% significance level is adopted and differences found, 1 occasion out of 20 will also lead to an error (referred to as a Type I error, rejecting the null hypothesis when it is in fact true).

Analysis of variance

This chapter of the Handbook is not going to cover all statistical tests that might be encountered, only offer a background and point to relevant material. However, some comments on Analysis of Variance (ANOVA) are called for as it is a technique that has a widespread use in Language Engineering.

ANOVA is a statistical method for detecting factors that produce variability in responses or observations. The approach is to control for a factor by specifying different values or treatment levels for it to see if there is an effect. It can be thought of as having sampled a potentially different population (different in the sense of having different means). Factors that have an effect change the variation in sample means where ``factor'' means a controlled independent variable. Another term encountered is treatment level which means ``controlled values of factors''.

Just to give an idea about the ANOVA approach, two estimates of the variances are obtained. The two variance estimates are the means for each factor about the overall mean (referred to as between group variance). The second is an estimate of the scores at a factor level about the mean factor level (referred to as within groups variance). If the treatment factor has had no effect, then variability between and within groups should both be estimates of the population variance. So, as discussed earlier when the ratio of two sample variances from the same population was considered, if the F ratio of between groups to within groups is taken, the value should be about 1 (in this case, the null hypothesis is supported). Statistical tables of the F distribution can be consulted to ascertain whether the F ratio is large enough to support the hypothesis that the treatment factor has had an effect resulting in different variance of the between group to the within groups means (the alternative hypothesis is supported). Another way of looking at this is that the between groups variance is affected by individual variation of the units tested plus the treatment effect whereas the within groups estimate is only affected by individual variation of the units tested.

ANOVA is a powerful tool as it has been developed to examine treatment effects involving several factors. Some examples of its scope are that it can be used with two or more factors, factors that are associated with independent and related groups can be tested in the same analysis, and so on. When more than one factor is involved in an analysis, the dependence between factors (interactions) come into play and have major implications for the interpretation of experiments. A good reference covering many of the situations where ANOVA is used is =1 (

; Winer 1971) . Though statistical texts cover how the calculations are performed manually, these days the analyses are almost always done with computer packages. These are easy to use if ANOVA terms are known and how the output should be interpreted. Indeed the statistical manuals for these programmes (Minitab, SPSS and BMDP) are important sources about how to conduct analyses and should be consulted.

Non-parametric tests

Parametric tests cannot be used when sample sizes are small or when ranks rather than continuous measures are used since the central limit theorem does not approximate the normal distribution in these instances. In these cases non-parametric (aka distribution-free) tests have to be used. The computations involved in these tests are straightforward and covered in any elementary text book ( =1 (

; Siegel 1956) ). A reader who has followed the material presented thus far should find it easy to apply the previous ideas to these tests. To help the reader access the particular test needed in selected circumstances (parameteric and non-parametric), a tree diagram for the different decisions it is necessary to make is given in Figure 1.

A number of representative questions a language engineer might want to answer were considered at the start of this section. Let us just go back over these and consider which ones we are now equipped to answer: First there was how to check whether there are differences between spontaneous and read speech. If the measures are parametric (such as would be the case for many acoustic variables), then either an independent or related t test would be appropriate to test for differences. An independent t test would be needed when samples of spontaneous speech were drawn from different speakers to the read speech and a related t test when the samples are drawn from the same speakers. If the measures call for a non-parametric test (e.g. ratings of clarity for the spontaneous and read speech) should be used. A Wilcoxon test would be used when the read and spontaneous versions of the speech are drawn from the same speaker and a Mann-Whitney U test otherwise.

If the investigators find differences between read and spontaneous speech that require them to use the latter for training data (see application described), how can they check whether language statistics on their sample of recordings is representative of the language as a whole --- or, what might or might not be the same thing, how can you be sure that you have sampled sufficient speech? The background information provided to estimate how close sample estimates are to population estimates is appropriate for this.

The next questions in our list given in the introduction lead on to the second major theme which we want to cover: How to set up a well-controlled experiment. The particular experiments that will be considered concern assessing the procedures for segmenting and labelling the speech for training and testing the ANNs though the lessons concerning good experimental design, etc. will apply to many more situations. Once we have got an idea what the experimental data would look like, we can consider how to treat the data statistically (which involves hypothesis testing again).

Experimental procedures

The principles of good experimental design will be illustrated by considering what procedures are appropriate to provide phonemically-labelled data which is to be used for training and, subsequently, testing a recognition system. Aspects of the procedures and considerations which will feature will apply to a wide variety of experimental situations encountered in connection with Language Engineering so the information provided will have some generality. Finally, it ought to be noted that the performance of a recogniser (to be dealt with in section ) can only be as good as the data it is trained on. Also, assessment results for recognisers will be misleading if the data used for training and testing are in error or whether sampling biasses are inadvertently introduced by adoption of inappropriate sample selection. Thus, there is a close relationship between the topics discussed here and those in section .

Experimental selection of material to employ

The issue of what material to employ applies to speech synthesis as well as recognition (the latter being the focus here). Before concentrating on some issues relevant to recognition, a brief comment is necessary about selecting samples for basing synthesis on as these have different requirements to those of recognition: The issues are about how to select the speaker (should it be male, female, should clear speech be required etc.

If an ANN-based recogniser is to be trained and tested, different requirements are imposed: A statistically representative sample of exemplars is needed in this case. The checks for the representativeness of the sample will depend on what speech units the implementation is to be based on (e.g. are phones, demisyllables, etc. to be recognised?).

Once this decision has been made, the next question is how to elicit a representative sample of the selected speech units. Some of the materials currently employed, though motivated for positive reasons, do not fulfill the need for representativeness. Phoneticians use the term phonetically balanced to highlight the controls they consider necessary. A dictionary definition of ``balanced'' would convey the idea of `well arranged for representativeness' and this has the same sorts of nuances that are required of a phonetically balanced passage. However, the representativeness will depend on what units the system is based on: For instance, the phonetically balanced text Arthur the rat is too short to include a representative sample of all the 900 or so demi-syllables that occur in samples of spontaneous speech. Thus, this text does not offer representative examples of all demi-syllables.

The lack of representativeness is more acute for other materials that have been used for training Hidden Markov Model recognisers: The SPAR sentences were developed to obtain an instance of each phoneme of English in a small amount of material. However, inspection of some of the sentences show that they look like tongue-twisters:

George made the girl measure a good blue vase.
Be sure to fetch a file and send theirs off to Hove.
Why are you early you owl?

A speaker is likely to experience difficulty on phonemes in tongue-twister material that he would not encounter when these same phonemes occur in other sentences. Moreover, the difficulty encountered is more acute for certain classes of phonemes (consonants and most of all with plosives) than others (the vowel sounds). A person who still wants to use this material might reply that it is conceivable that these sentences could have been uttered, which is true. However, the discussion of sampling (above) illustrates that they cannot be considered a simple random sample. If it is necessary to use the phonemes in them as instances for training particular phone models, the variables measured on them should be checked statistically against other groups of sentences that also contain these phonemes. This analysis would establish whether there are differences between, say, their acoustic properties in the wider samples and these compressed versions. To my knowledge, these tests have not been conducted (the point made about tongue twisters suggests that they will not).

The issue of inappropriate control material can lead to misleading conclusions about the performance of a recogniser. This constitutes a major topic of investigation: If the recogniser is learning about phones that are produced atypically, when it fails to recognise a ``typical'' phone, has it made an error or not? For now, it is assumed that selection of appropriate speech samples has been made and checked using the statistical procedures outlined above. The next issue is how to label the data set for use for recognisers. There are two parts to this: to indicate where the segments start and finish, and what the identity of each segment is. We will deal with segmentation and classification in turn as illustrative examples of experimental design and statistical analysis. Note that the issue of comparing two sets of segmentations or classifications is identical, whether the two classifications are from two humans or from a human and a speech recogniser. Therefore, the procedures outlined are relevant to marking corpora as well as assessing recognisers.

Segmentation

Making segmentation judgments separate from classification judgements

One basic precept when conducting experiments is that the subject should have a clear idea about what decision is being made. In the case of segmenting and labelling when these are conducted together, a situation is encountered where the decisions are mixed up (or confounded). This makes it difficult during analysis to disambiguate whether, when an error occurs, the error is associated with one decision (say segmentation) or the other (classification). This problem is inherent to Signal Detection measures (see section ). To avoid this (recommendation) judges should be required to do segmentation and categorisation as separate tasks. Segmentation needs to be conducted first and the accuracy of these decisions reported in circumstances where the judge is not able to classify the sound. The segments can then be taken and played for categorisation and the accuracy of categorisation reported (see section ).

In the absence of any data reporting thorough analyses of segmentation accuracy that is not confounded with classification accuracy, some suggestions how this can be done are outlined. The guidance concerns:

How to make segmentations without confounding effects of classification.
Suggested guidelines concerning comparative analysis between human judges or between human judges and automatic algorithms.
Recommendation about what judges to use, and
Sample size for assessment of performance.

How to make segmentations without confounding effects of classification

In experiments, like has to be compared with like. Thus, if an automatic segmentation algorithm is available that works on acoustic patterns, to check the accuracy of the algorithm human judges should be required to do the same thing as the algorithm. To ensure this, when performing segmentations, it is advisable that the humans should not hear the speech to ensure that classification is not influencing segmentation positioning. (As an aside, it should be noted that the hypothesis whether having both audio and visual modalities available actually makes judgments of segment boundaries more reliable). Oscillographic or spectrographic representations presented visually are closer to the information supplied for a segmentation decision to a machine. Therefore, this way of presenting material for segmentation by humans avoids confounding segmentation judgments with classification decisions and allows comparison between human and machine performance. It has been demonstrated that Victor Zue (a highly experienced human judge) can make what is likely to be a more difficult decision (classification) based on this information. Consequently, ideally inexperienced judges should be used and a check made that they do not have this skill to ensure a segmentation, not a classification, decision is being made.

Suggested guidelines concerning comparative analysis between human judges or between human judges and automatic algorithms

I am not aware of any work how accuracy of segmentations compares between two human judges on the one hand with human-auto-segmentation on the other. A study conducted by my own research group suggests that segmentation accuracy might have marked effects. In this study, it was shown that different segment boundaries (labelled by different judges) can result in substantially different accuracy in an ANN recogniser indicating that disagreements about boundary placements have significant effects on recogniser classifications.

To perform a comparative test between human judges' performance and automatic algorithms ostensibly doing the same thing, it is recommended that segmentation performance be obtained from at least two judges as well as the algorithm. Parameters that might be measured are mean difference in boundary location between humans and between humans and the algorithm or correspondence between boundary placements. The data could then be analysed with ANOVA where the null hypothesis would be that there would be no difference between inter-human differences and human-algorithm differences with respect to these parameters.

Note that it is not a foregone conclusion that segmentation will be comparable between humans nor between humans and the algorithms. If, for instance, laryngeal vibration is used as a basis for segmentation, the point where it starts is often not clear cut.

Recommendation about what judges to use

It has already been mentioned that the judges should not be experienced else they might be able to classify segments based on visually presented information. A minimum recommendation would be to check with phonetically naive judges. Also, the judges who do the segmentations should be different to those who do the classifications so that there is no carry over of segmentation decisions to classification decisions.

Sample size for assessment of performance

The factors to take into account when selecting what sub-sample to perform the segmentation assessment are whether to choose sections from all speakers in case judges or speakers show specific difficulties or whether to do complete assessments on selected judges and speakers. Other factors to consider (which can be answered with the statistical techniques outlined earlier) are what length of sample to take --- the sample ought to be at least long enough to contain examples of all phones of interest if phones are going to be used in the recogniser. This will ensure that where speakers have specific difficulties speaking certain phones, or judges have difficulties in locating some phones can be identified.

Classification

The main issue to be covered is the extent to which ``judges'' agree about a category to be labelled. The categories can be phonemes, lexical items, prosodic factors such as tone units, etc.

Judges

The problem that experienced judges might be able to read spectrograms has been mentioned in connection with choosing judges to perform the segmentation task. Another factor which commends using inexperienced judges for classification is as follows: Working through the file left to right might induce contextual effects: If the judge is a phonetician, he might well be influenced in boundary placement by the sound just heard. For example this sort of expert judge knows the effects of plosives on duration of the following vowel or those of pre-pausal lengthening and, consequently, this might influence his categorisation of events in a way atypical of the population of listeners from (in our ANN example) the EU country at large. (It is presumed that the recogniser is to be a model of representative listeners, not a model of listeners trained to hear things in ways that might be coloured by alternative theories.)

Procedural

Some of the biasses and measurement considerations that procedures should take into account will now be outlined. The following topics will be covered:

Limitations of category responses and ways of circumventing,
Range effects.

Limitations of category responses and ways of circumventing

To consider this topic, a hypothetical example will be considerd: Phone categories are embedded in a speech context and the phones have to be labelled. It is necessary to consider what sorts of factors are going to affect labelling performance in order to develop good procedures. Some of the phoneme categories which occur are differentiated along continuous acoustic measure, e.g. vowel duration signalling the distinction between /i/ and /I/ or frication duration signalling affricate or fricative. The exemplars from each continuum will range on a continuum from very clear cases of (say) /tS/ (short duration) to very clear cases of /S/ (long in duration). In this experiment, one group of subjects might indicate where each sound heard would be placed on a duration continuum using a 5-point scale. The duration of each sample is characterised by the modal value given by this group of judges. An additional group of judges then categorises each sample as /tSa/ or /Sa/. The proportion of /tSa/ category judgments (ordinate) is plotted against severity rating (abscissa) for each judge. The principle difference between groups is in the position of the /tSa/ -- /Sa/ category boundary on the duration continuum (i.e. the estimated severity value where subjects gave 50% /tSa/ and 50% /Sa/ responses).

Intervals that are assigned to different categories (/tSa/ or /Sa/) by different judges leading to disagreement may be due either to (alternative 1) different judges using a common category boundary but judging the interval to be placed at different points on the duration continuum, or (alternative 2) the judges placing the interval at the same point on the duration continuum but with the judges employing different category boundaries. The first type of difference between judges are real disagreements and have a different status to the second type. The investigator might want to make a decision whether or not to include the instances in training.

To illustrate diagrammatically the situation where judges (J1 and J2) employ a common category boundary with respect to the duration continuum, intervals that judges disagree about would be depicted as follows:

Figure: Judgement on category boundaries

To differentiate these alternatives, it is necessary to obtain a measure of where items that judges ``disagree'' about lie on the continuum. The general recommendation that follows is that ratings of the physical dimension of each event being judged as well as the binary categorisation of that event should be collected in the procedures.

An additional advantage offered by collecting ratings is the improved basis of comparison between judges for establishing agreement. The measures that need developing concern how to take account of the information provided by ratings and the (related) information about where judges place their boundary to obtain an aggregate agreement measure about categories. Also, procedures for categorising synthetic phonemes have been developed where endpoints from physical continua (clear exemplars of the category) are trained and generalisation across more ambiguous instances trained (Damper) which show good comparability with human judgements. To apply these procedures to real speech, an indication where the phones lie relative to the physical dimensions on which the sound is judged needs to be obtained. The ratings provide this information.

Range effects

The incidence of category counts can be biassed by different acoustic properties in the surrounding recording (referred to as the context). To illustrate, the example of /tSa/s and /Sa/s embedded in speech that need labelling that are signalled by duration can be taken though the effects described extend to other acoustic parameters of speech. Speech variation will occur locally within an utterance and between utterances (affected by, for example, speech rate differences between speakers). Usually events are judged in context and the rate of the context, therefore, depends on which speaker is selected. Judgments about the attribute value an event has is affected by the range of the attributes in the contextual material presented for judgment at the same time (range effect,

=1 (

; Parducci 1965) ). So, here judgment about the temporal characteristics of an event to determine, for example, whether it is a /tSa/ or /Sa/ will be affected by the temporal properties of the rest of the material: A sound will have to be longer to be judged /Sa/ in a slow context than a fast one. Thus, judgments will be biassed by the contextual material. The changes in /tSa/ and /Sa/ counts are due to judges being biassed by the context not of the segment, they are not due to changes in the behaviour of the speaker. The net effect of the spurious decrease in /Sa/s when the rate is slow.

Next: Assessing recognisers Up: Assessment methodologies and Previous: Introduction

WWW Administrator
Fri May 19 11:53:36 MET DST 1995