This chapter is about methodology for assessing various components involved in language engineering. Some of the sorts of things it should give you are guidelines how to go about ensuring that you have sampled enough speakers to ensure that you can make claims about how likely the results are to generalise to a speaker population at large (where population refers to your target market and will vary from application to application), how to compare performance of your recogniser or synthesiser with other that are on the market, how many speakers to include in benchmark tests of speaker verification systems to appraise performance and so on. For these purposes, an understanding of how to analyse your data statistically is needed.
At other times a user might need to test some very specific idea about, for example, what is going on with his recogniser, whether some gambit for mimicking other people's voices will allow impostors to break into a speaker verification device, what the critical acoustic attributes are that govern the perceptibility of a message in order to improve the systems and how to set up experiments with dialogue systems to check whether they will work adequately for some purpose before committing design engineers to their implementation. The way of approaching the latter group of questions calls for an understanding of the steps involved in setting up and analysing experiments.
The information provided is, then, going to cover general techniques from many diverse areas both in terms of techniques (statistics and experimentation) and applications (including the above examples and many more). However, this chapter cannot hope to be exhaustive in terms of its coverage nor choose an example for assessment which is directly applicable to all needs. However, though there will not be an example for every application encountered, the methodological tools provided should offer some idea of the way to approach many problems that will be encountered. The particular examples chosen for illustration have been raised in consultation with authors of some of the other chapters.
Three further things need stressing:
Where large scale sampling from a population is not possible because of prohibitive costs, but where it is necessary to report performance of the system for infrequent events, experiments may provide an alternative approach. In the speaker verification example, for instance, impostors may try and break in to the speaker verification system. This would be costly to check in three senses:
In talking about procedural considerations in language engineering, it will help to make things concrete. Let us assume that a client has commissioned development of a speech recognition system (System A) from scratch where expense is no object (sic). It is to be employed in a European country where all inhabitants might want to use it. At the end of the day the client wants to have some idea about how its performance compares with another system on the Market (System X). The company is given a free hand when developing the system and would prefer, for convenience purposes, to develop it on the basis of read speech though, as noted, it will eventually have to operate with spontaneous speech. The team assigned to the project decided to develop a system based on Artificial Neural Networks (ANNs). Some of the questions the team commissioned to do the work may decide to address (no claims for exhaustiveness) are:
The preceding highlights some of the statistical analysis and experimental procedures that need to feature in language engineering. Moreover, the specific questions raised, though appertaining to a particular issue of concern, are illustrative of many similar problems that language engineers encounter. Now we will set about attempting to provide answers to these (and other) questions. The reader should be able to employ the materials to answer related questions.
The remainder of the chapter is organised in four main
sections (--
). These are (
) statistical and (
)
experimental techniques to ensure that the corpora employed for
training and testing are representative, (
) assessing speech
recognition, (
) speaker verification and (
)
dialogue systems.
Sections
and
introduce an understanding of statistical
analysis and experimentation. Thus, the material presented in that
section should be read by everyone. The materials in
sections
,
and
are specifically focussed
on the hypothetical scenario
outlined above. A final warning: Though the organisation of
materials into these sections is convenient, note that the sub-division into sections is to some extent artificial: The
relationship between setting up corpora and
testing recognisers is
a case of the proverbial chicken and egg --- apparently poor
performance of a recogniser can be due to training and testing on
a poor corpus. In turn, speaker verification and dialogue systems
depend to an extent on speech
recognition.