Next: Statistical and experimental Up: Assessment methodologies and Previous: Assessment methodologies and

Introduction

How to read this chapter

This chapter is about methodology for assessing various components involved in language engineering. Some of the sorts of things it should give you are guidelines how to go about ensuring that you have sampled enough speakers to ensure that you can make claims about how likely the results are to generalise to a speaker population at large (where population refers to your target market and will vary from application to application), how to compare performance of your recogniser or synthesiser with other that are on the market, how many speakers to include in benchmark tests of speaker verification systems to appraise performance and so on. For these purposes, an understanding of how to analyse your data statistically is needed.

At other times a user might need to test some very specific idea about, for example, what is going on with his recogniser, whether some gambit for mimicking other people's voices will allow impostors to break into a speaker verification device, what the critical acoustic attributes are that govern the perceptibility of a message in order to improve the systems and how to set up experiments with dialogue systems to check whether they will work adequately for some purpose before committing design engineers to their implementation. The way of approaching the latter group of questions calls for an understanding of the steps involved in setting up and analysing experiments.

The information provided is, then, going to cover general techniques from many diverse areas both in terms of techniques (statistics and experimentation) and applications (including the above examples and many more). However, this chapter cannot hope to be exhaustive in terms of its coverage nor choose an example for assessment which is directly applicable to all needs. However, though there will not be an example for every application encountered, the methodological tools provided should offer some idea of the way to approach many problems that will be encountered. The particular examples chosen for illustration have been raised in consultation with authors of some of the other chapters.

Three further things need stressing:

That although in some cases broad guidelines can be given (for example, in selecting how many speakers to choose to record), a trade off is often involved between what is ideally needed and cost (in terms of money and time). Though the procedures tell you, for instance, how many subjects are needed for a data corpus to afford a representative sample of a population according to some criterion, if the numbers needed are too large, they can be cut down by relaxing the criteria. The amount of relaxation in the criteria that is permissible will involve an element of experience in prioritirising what is most important: So if, for example, your speech verification system is to be used for large business accounts it would be important to ensure that the system has been tested on stringent criteria. Even in cases where cost is no object, statistical techniques cannot give any absolute guarantees that the system cannot be broken into.
Where large scale sampling from a population is not possible because of prohibitive costs, but where it is necessary to report performance of the system for infrequent events, experiments may provide an alternative approach. In the speaker verification example, for instance, impostors may try and break in to the speaker verification system. This would be costly to check in three senses:
1. To collect examples of attempted break-ins if they are infrequent.
2. To verify that the system has been broken in to.
3. The possible financial costs of the break-in. Here, an alternative approach would be to run experiments with individuals likely to be able to mimic voices such as actors or ventriloquists. In making the judgment whether to approach system assessment from an experimental perspective or via statistical analysis of the system in use, again the investigator will need to draw on his or her experience. If an experimental approach is decided on, the skill and experience of the investigator will again be taxed in selecting what vocal disguises an impostor might take and in the setting up of appropriate tests for checking the vulnerability of the system.
What will not be presented here are statistical analyses of, for example, the statistical corpora available in other chapters. What is presented here is some of the background that will allow users of the handbook to do the work for themselves.
Statistical development, experimental techniques and engineering products and techniques are advancing at a rapid pace. However, statistical and experimental know-how has not featured to any great extent in engineering, and statisticians and experimentalists usually have not drawn on engineering examples or considered engineers' concerns. Thus many of the ``recommendations'' made here are a first stab at these issues. There are often many ways of achieving a particular goal and the limited number of options that it is possible to consider here can only give a narrow perspective. As these ideas are tried out, other preferred alternatives will undoubtedly arise. Thus, at least some of the recommendations are likely to be short-lived.

Role of statistical analysis and experimentation in Language Engineering Standards

In talking about procedural considerations in language engineering, it will help to make things concrete. Let us assume that a client has commissioned development of a speech recognition system (System A) from scratch where expense is no object (sic). It is to be employed in a European country where all inhabitants might want to use it. At the end of the day the client wants to have some idea about how its performance compares with another system on the Market (System X). The company is given a free hand when developing the system and would prefer, for convenience purposes, to develop it on the basis of read speech though, as noted, it will eventually have to operate with spontaneous speech. The team assigned to the project decided to develop a system based on Artificial Neural Networks (ANNs). Some of the questions the team commissioned to do the work may decide to address (no claims for exhaustiveness) are:

How to check whether there are differences between spontaneous and read speech, then make a decision whether they can use read speech.
If they find differences between read and spontaneous speech that require them to use the latter, how can they check whether language statistics on a sample of recordings they make to train and test the ANNs is representative of the language as a whole? Whether read or spontaneous speech is used, segments need labelling for training the networks and judges need to be brought in for this purpose.
What procedures are appropriate for the tasks of labelling and classifying the segments?
How can the accuracy of segment boundary placement and category classification by the judges be assessed?
How can improvement during development stages be monitored? This usually involves correct recall of training data by the ANNs. Here segmentation and classification differences between judges (see and ) might affect recogniser performance. The preceding tests are vital to ensure that the training data is good and that changes in recogniser performance reflect improvements in the architecture, not artefacts of poor training data: An improvement in recogniser performance can be due to a genuine improvement that has been effected or a judge might have made errors and some change allows the system to make the same ones which appear to be correct (i.e., the two errors cancel themselves out). Without appropriate assessment of judges' performance, the latter can never be ruled out.
How do you choose appropriate test data?

The preceding highlights some of the statistical analysis and experimental procedures that need to feature in language engineering. Moreover, the specific questions raised, though appertaining to a particular issue of concern, are illustrative of many similar problems that language engineers encounter. Now we will set about attempting to provide answers to these (and other) questions. The reader should be able to employ the materials to answer related questions.

The remainder of the chapter is organised in four main sections (--). These are () statistical and () experimental techniques to ensure that the corpora employed for training and testing are representative, () assessing speech recognition, () speaker verification and () dialogue systems. Sections and introduce an understanding of statistical analysis and experimentation. Thus, the material presented in that section should be read by everyone. The materials in sections , and are specifically focussed on the hypothetical scenario outlined above. A final warning: Though the organisation of materials into these sections is convenient, note that the sub-division into sections is to some extent artificial: The relationship between setting up corpora and testing recognisers is a case of the proverbial chicken and egg --- apparently poor performance of a recogniser can be due to training and testing on a poor corpus. In turn, speaker verification and dialogue systems depend to an extent on speech recognition.

Next: Statistical and experimental Up: Assessment methodologies and Previous: Assessment methodologies and

WWW Administrator
Fri May 19 11:53:36 MET DST 1995