Next: Appendix 1 Up: Assessment methodologies and Previous: Assessing speaker verification

Interactive Dialogue systems

Dialogue systems include aspects of speech recognition and synthesis. Therefore, once again, much of what has been said before is relevant to this topic. Similarly, decisions about trial sizes can be made on the basis of the statistical information provided previously.

WOZ

A desription of WOZ has been included in another chapter. Here it is intended to ask what this technique is useful for. The simple answer is setting up simulation of a dialogue system which will allow testing implications of a required system without committing the investigator to its actual implementation. Itself it is an experimental procedure. The advantage is its flexibility in comparison with a computational implementation. WOZ design brings the engineer and ergonomist together to realise a complex task. It should be borne in mind that WOZ is a means to an end not an end. Ultimately, if WOZ has done the job expected of it, it will be discarded and a system implemented. The expectations of the two parties to the bargain needs to be clearly specified at the outset.

The engineer expects the WOZ simulation to provide answers as to what the likely structure of the dialogue system should eventually look like. The engineers should be expected to provide certain information necessary to get the ergonomist started, in particular, a language model for the dialogue system and on-going advice about the advisability from an implementational point of view about changes the ergonomist is proposing to effect. A potential problem at the outset is if this initial specification of language is seen by the engineer as the core of the work. After initial experimentation the ergonomist will need to go back and request ways in which this and other aspects of the simulation be altered by the Language Engineer.

It is necessary to realise that for WOZ to be useful it has to be cheaper. It should also be more efficient due to the flexibility it offers. The engineer should make the group aware what time is allowed for development and testing, what call they can have on the engineer's time/expertise, etc. There seems to be a feeling that the engineer often presents problems that they know when they see a proposed answer whether it is what they want or not. This ``searching in the dark'' of the invesigators is an unsatisfactory situation. Obviously, it is not the role here to arbitrate which is more in the realms of politics.

The actual aim will be to attempt to come up with some experimental procedures that will allow the engineer and ergonomist to achieve their ultimate goals. The major practical requirement of WOZ is that it provides answers or proposals about what the ultimate system should be like. The major methodological question is somewhat different --- how good is the simulation. Statistical and experimental procedures will be focussed on that. The comments will be organised in the same order as in the dialogue chapter (principally sub-sections 2 and 3, not the first stage nor iteration which involves the same procedures as those considered for sections 2 and 3.

Audio-alone simulations are described, followed by some brief comments on current developments of multi-modal systems.

Audio-only simulations

Requirements

The Language engineer should provide a description of the user dialogue which supports activities for the proposed task. A representation such as a state transition diagram, would be appropriate. A second requirement is a performance specification, i.e. a description of exactly what the device is supposed to do.

Once provided with these basics, the ergonomist needs to establish what sort of factors might limit subjects in their ability to keep up the pretext that the dialogue is with a machine and, correspondingly, what sorts of factors in the task (the job of the wizard) are likely to facilitate or prevent this pretext.

Subject variables

Some precautions which need to be taken are listed. First, take care to ensure that the observations are made under conditions which will elicit representative performance: the vocabulary, device users and operating environment should be as similar as possible to the device being simulated. This includes making errors similar to those which might occur in the actual device.

A typical set of instructions to the people trying to create the simulation might be as follows: You are required to transcribe all utterances of the user subject onto the computer using the keyboard. Speak aloud what is displayed by the computer, and your utterances will be transmitted back to a subject. Be careful to read out what appears on the screen, and not just repeat what the subject said. Although your speech will be distorted to make it sound like synthesised speech, try to minimise the inflections in your voice and to speak as consistently as possible, in order to enhance the ``mechanical'' effect.

Wizard variables

The requirements associated with the wizard are really those of a good experimental procedure (section ). The output should be consistent in content, style and pace. Two examples are

A given command should trigger the same response from the wizard,
The response time must comply to the subject's expectations.

Since the job of being a wizard is not easy, wizards may need to be trained to produce predefined replies or menus, etc.

Factors likely to determine variability in wizards are: the level of skill exhibited by the system subject which will include fatigue and individual differences in aptitude. The first factor may be controlled by recruiting wizards who are likely candidates for developing these skills, and by training. Individual differences will be eliminated if only one wizard is employed; however, this advantage may be reduced if the study is large in scope (see section on multimodal simulations).

Since cognitive load is high, two wizard configurations are used in recent studies --- one performs the I/O (receives questions and generates the answers), the second performs ``task level processing'' (generates the answers to be formulated by I/O wizard). It is considered that the two-wizard setup is more likely to achieve consistency and not increase response time though these need verifying experimentally.

A final important recommendation is that there should be a permanent record of performance. To this end, questions and answers should be tape-recorded.

Multimodal

Future interactive systems may require input from more than one modality: Examples would be speech input to generate visual text or voice operated drawing programmes. When WOZ techniques are employed for these applications, the extra factors that need to be considered are:

: Task complexity: The more modalities, the more functions need to be simulated.
: Information bandwidth: There are many ways of providing input. In addition, the input may be too high for single wizard so that his behaviour becomes inconsistent and multiple wizard configurations would be essential.
: Multiwizard configs: The need for multiwizard configurations results in issues about how to organise collaboration between them. Workload must ideally be spread equally. This is difficult since it relies on the subjects' behaviour and, thus, the roles of the wizards may need to change dynamically. For this reason another (supervisory) wizard may be needed.

Coutaz and associates have spent time developing recommendations for wizard collaboration. They structure a wizard's task in three steps:

Acquisition (analysis of message).
Interpretation what the subject's response means in connection with the task faced.
Formulation (the emission of an answer).

Dialogue metrics

The two major aspects of dialogue between humans and machines concern the content of the message and the sequence of conversational episodes between the participants. Techniques for analysis of the content of dialogue are being developed and a lot has been written about them. Space and time prevents a comprehensive treatment being given. The situation for turn-taking is more appropriate since less that is germane to Language Engineering has been written about them.

In dialogue communication between humans, conversation among speakers is characterised by turn taking: one participant, A, talks, stops; another, B, starts, talks, stops, and so we obtain an A-B-A-B-A-B distribution of talk across two participants. This transition from one speaker to another has been shown to be orderly, with precise timing and with less than 5% overlap ( =1 (

; ervin-tripp 1979) ). The mechanism responsible for regulating turn taking is also capable of operating in quite different circumstances: The number of parties may vary from two to 20 or more; persons may enter and exit the pool of participants; turns at speaking can vary from minimal utterances to many minutes of continuous talk; and if there are more than two parties then provision is made for all parties to speak without there being any specified order or `queue' of speakers. The same system seems to operate equally well both in face-to-face interaction and in the absence of visual monitoring, as on the telephone. Concern here will be with two communicators. Though these comments have been directed at inter-human dialogue communication, the same considerations apply to human-computer interaction. The question is what sort of metrics are appropriate to characterise the interaction between two communicating parties?

Psycholinguistic metrics

=1 (

; sacks schegloff 1974) and =1 (

; sacks schegloff 1978) suggest that the mechanism that governs turn-taking, and accounts for the properties noted, is a set of rules with ordered options which operates on a turn-by-turn basis, and can thus be termed a `local management system'. One way of looking at the rules is a sharing device operating over a scarce resource, namely control of the `floor'. Such an allocational system will require minimum units over which it will operate. These units are, in this model, determined by various features of linguistic surface structure: they are syntactic units (sentences, clauses, noun phrases, and so on) identified as turn-units in part by prosodic means.

It is important to see that, although the phenomenon of turn-taking is obvious, the suggested mechanisms operating it are not. For a start, things could be quite otherwise: For example, it is reported of the African people, the Burundi, that turn-taking (presumably in rather special settings) is pre-allocated by the rank of the participants, so that if A is of higher social status than B, and B than C, then the order in which the parties will talk is A-B-C ( =1 (

; albert 1972) ). Of course in English-speaking cultures too there are special non-conversational turn-taking systems operative in, for example, classrooms, courtrooms and other `institutional' settings, where turns are (at least in part) pre-allocated rather than determined on a turn-by-turn basis. Nevertheless, there is good reason to think that like may aspects of conversational organisation, the rules are valid for the most informal, ordinary kinds of talk across all the cultures of the world. There is even evidence of ethological roots for turn-taking and other related mechanisms, both from work on human neonates ( =1 (

; trevarthen 1974) and =1 (

; trevarthen 1979) , and research on primates ( =1 (

; hinoff) ).

Other psychologists working on conversation have suggested a quite different solution to how turn-taking works. According to this other view, turn-taking is regulated primarily by signals, and not by opportunity assignment rules at all. =1 (

; duncan 1972) , for example, describes three basic signals for the turn-taking mechanism:

Turn-yielding signals by the speaker,
Attempt-suppressing signals by the speaker, and
Back-channel signals by the auditor.

These signals are used and responded to in a relatively structured manner. On such a view a current speaker will signal when he intends to hand over the floor, and other participants may bid by recognised signals for rights to speak. One of the most plausible candidates for such signals is gaze: it seems roughly true, for example, that a speaker will break mutual gaze while speaking, returning gaze to the addressee upon turn completion ( =1 (

; kendon 1967) ; =1 (

; argyle 1973) ).

The problem here is that if such signals formed the basis of our turn-taking ability, there would be a clear prediction that in the absence of visual cues there would either be much more gap and overlap or that the absence would require compensation by special audible cues. However, work on telephone conversation shows that neither seems to be true --- for example, there is actually less gap and shorter overlap on the telephone ( =1 (

; butterworth hine 1977) ; =1 (

; ervin-tripp 1979) ) and there is no evidence of special prosodic or intonational patterns at turn-boundaries on the telephone although there is evidence that such cues are utilised both in the absence and presence of visual contact to indicate the boundaries for turn-constructional units ( =1 (

; Duncan Jr. Fiske 1977) ).

In any case it is not clear how Duncan's signal-based system could provide for the observed properties of turn-taking; for example, a system of intonational cues would not easily accomplish the observable lapses in conversation, or correctly predict the principled bases of overlaps where they occur, or account for how the particular next speaker is selected ( =1 (

; goodwin 1979) ). Therefore, the signalling view, plausible as it is, viewed as a complete account of turn-taking seems to be wrong: signals indicating the completion of turn-constructional units do indeed occur, but they are not the essential organisational basis for turn-taking in conversation.

When one speaker interrupts another, the two can be said to be disputing who has the turn. Interruptions can occur because one participant tries to dominate or disrupt the conversation, but it can also be the case that mistakes occur in the way these subtle turn-yielding signals are transmitted and received. One study has demonstrated (cf. =1 (

; beattie 1982) ) that many interruptions in an interview with Mrs Margaret Thatcher, the ex-British Prime Minister, occur at points where (individual judges) independent judges agree that her turn appears to have finished. Beattie goes on to suggest that she is unconsciously displaying turn-yielding cues at certain inappropriate points. The analysis of the results from this study suggests that while the speaker is actually giving a number of cures to the end of her turn, the one which she considers paramount may be different from the cue which her interlocutor considers paramount. If she considered that the most decisive cue to the end of her turn was letting her voice drop to around 140Hz instead of keeping it no lower than 160Hz, whereas her interlocutor considered that the most decisive cue was a rapidly executed final fall rather than a slow fall, their respective decisions as to whether or not she had finished her turn would differ in precisely those cases which were disputed in the present sample of utterances. The transcription of the turn-final and turn-medial utterances were separately inspected, and the prosodic and vocal quality features occurring in three or more of these utterances listed. Five `final' features were found: pitch downstep (rapid drop) before the main fall; double falling contour; allegro portion before the tonic; whispery voice; and creaky voice.

Acoustic-based measures

A disadvantage of dialogue-based metrics is that (like content analysis), they require time-consuming manual analyses. It would be better, therefore, if automatic, acoustic-based procedures could be developed. A potential problem for acoustic-based metrics of dialogue interaction is that often speakers are not acoustically isolated. This need not (and in some available recordings does not) apply over telephone connections and potentially, therefore, for many dialogue interaction systems. These allow, then, acoustic metrics of disruption to be developed which have the advantage that they are automatic.

Little work has been done on this topic (cf. =1 (

; howell 1990) ) Clear speech and turn-taking cues in telephone dialogue, Report to BT). Since prosodic factors are the main source of turn taking cues, acoustic metrics associated with these factors (amplitude, pitch and duration) have been measured.

The terms used to describe the various components of an interruption are summarised in the next figure.

Figure: Schematic illustration of terms used in connection with speaker interruption patterns

In this example, Speaker A is interrupted (unsuccessfully) by speaker B. The ordinate represents activity (which is happening when the speech is above the baseline).

Amplitude on each channel can be measured as described in CCITT Rec. P.56 (see also =1 (

; Carson 1984) ). ``Active speech level is measured by integrating a quantity proportional to instantaneous power over the aggregate of time during which the speech in question is present (called the active time), and then expressing the quotient, proportional to total energy divided by active time, in decibels relative to the appropriate reference.'' ( =1 (

; CCITT 1988 : 112 ) ). The method works by exponentially averaging the rectified signal values. Once these are obtained, a set of thresholds spaced 6 dB apart are applied to the envelope and each value of the envelope is compared against each of the thresholds in turn. A hangover function is defined for each of the thresholds and these allow a certain length of time for the speech to expire before it is judged inactive at the threshold level in question.

The true active level occurs when the threshold is such that the difference between the active level estimate and the threshold is equal to 15.9 dB. Where (as is usually the case) the true active level falls between two test threshold values, the true threshold is found by linear interpolation between these two. At the end of processing, for each speaker, the programme gives the threshold level calculated from pass 1, the percentage of the time that the speaker is active, the activity level, and the margin activity level-threshold value. This latter value is included as a check as it should be close to 15.9.

The new additions that need to be made to apply this to telephone dialogue are that once the threshold for active speech on each channel has been determined, these values are used to determine when the speakers are active. The data can then be passed through the same process as before, except this time, each processed signal value is only compared with the threshold that was set for its channel in the first pass.

The times when a channel becomes active and inactive can be recorded, and active levels calculated for the individual active periods. When one speaker interrupts another, the level for the already active speaker from the start of this stretch of speech up to this point is calculated. The level from the point of interruption to the point where one speaker stops is measured. The level of speech of the speaker who continues from the termination of simultaneous speech to the point where this speaker becomes inactive or the other speaker interrupts again (whichever is sooner) is also calculated. Finally, group statistics are calculated (number of occasions the various sequences occur, mean and standard deviation of level before interruption, for both speakers during interruption and for the speaker who continues and duration of the interrupted speech) for the various speaker-interruption combinations.

The occurrence of pauses is calculated by determining whether there is activity at the previous time step but not at the present. A variable is set to say who the last speaker was (both if they became inactive simultaneously), and for every sample where both channels are inactive, a pause counter is incremented. When activity begins again on either or both channels, if the current speaker (where A and B become active simultaneously, they are both the current speaker) matches the previous speaker (either where both became inactive simultaneously), the silence is recorded in that person's speech. A program can give the start time and duration of pauses made by each speaker, and start time, duration and average level for each period of activity. The definition of a ``pause'' requires further work and no analyses are reported.

Interruption parameters

All sections of speech where a speaker is interrupted can be located by the programme described in Appendix 2. These data can be partitioned with respect to which speaker was originally speaking and which speaker continued after the interruption. These are designated AA, AB, BB and BA where the first letter refers to the speaker speaking before the interruption and the second letter for the speaker speaking after the interruption. Two factors differentiate these data --- the initial speaker (factor 1) which differentiates the first and second from the third and fourth and whether the same or a different speaker continues (factor 2) which differentiates the first and third from the second and fourth.

Six measures are available for each of these combinations. These are:

Number of occurrences of each type.
Mean duration of the sections.
Level of the lead in for a speaker.
Level during interruption for that speaker.
Level during the interruption for the other speaker, and
Level of the continued speech.

The data are, necessarily incompletely crossed --- the sections before and after interruption mean that no level can be measured for the other speaker on these occasions whilst levels for both speakers can be measured during an interruption. This calls for a somewhat more complicated ANOVA model to determine what factors lead to differences in interruption patterns ( =1 (

; howell 1990) ) which illustrates some of the flexibility of this analysis technique mentioned previously.

Pitch analysis. Pitch movements (fall, rise, rise-fall, fall-rise and flat, designated ,--) cue turn-taking behaviour under different conditions. However, there appears to be no objective method for determining what category a particular pitch movement falls into. Moreover, it is pitch movements in nuclear syllables that should be of particular interest to us, despite there being no apparent economical way of determining the location of nuclear syllables. In the absence of a better method, what happens to the fundamental frequency in the 200ms (roughly the length of an average syllable) period of voiced utterance before the time specified by the classification routine (see previous section) can be ascertained? This can be done as follows: Starting at the specified offset time, fundamental frequency is estimated in 25ms windows, moving backwards 5ms at a step. This is done by first filtering the signal at 1000Hz, autocorrelating, and then examining autocorrelation peaks for periods between 3 and 12.4ms (c. 80--333 Hz). Taking the highest peak as the fundamental frequency estimate often leads to wildly fluctuating estimates, presumably because more energy was often present in the first harmonic than in the fundamental. Thus we would see successive windows returning fundamental frequency estimates of 160Hz, 320Hz, 320Hz, 160Hz, 160Hz, etc. To iron out such fluctuations, if the peak found corresponded to a frequency value within twenty Hertz of the last estimate obtained, the frequency estimate for the current window was halved. This gives far more consistent results. If the peak found by autocorrelation is high enough --- compared to a threshold which is higher if frequency is wildly fluctuating or the previous window is judged to be unvoiced --- the estimated frequency is returned, otherwise the window is judged to be unvoiced. Utterances often ended with fricatives or exhalations of air with no voicing. To analyse the portion of the utterance 200ms before the end would often yield no useful information about voicing changes. The method used, therefore, was to move back in 5ms steps, but only start recording fundamental frequency values for analysis after 2 successive windows had been determined by autocorrelation to be voiced. If this has not occurred within 300ms back from the point specified by the classification routine, the pitch movement is deemed to be indeterminate. Otherwise, 36 successive fundamental frequency values are recorded (or zeros if they are deemed to be unvoiced). These are subjected to a smoothing process in the reverse order from which they were recorded (ie ``forwards'' in time rather than backwards) which once again substitutes half of an estimated fundamental frequency value for the estimate if the half value is within 10Hz of the previous value, and passed to the frequency movement classification routine. This routine crudely splits the 36 values into three groups of 12. If any group of 12 has all zero values, the movement type is classified indeterminate, otherwise the means of each section are compared. A frequency ``change'' is noted if the mean of a section is more than 10% altered from the mean of a previous section. This is a totally arbitrary method, but is no less valid than applying a t-test or any complicated and time consuming statistical analysis. The pitch movement type is classified as ``rise-fall'' if the mean frequency is at least 10% higher in the second section than the first, and at least 10% lower in the third section than the second. ``fall-rise'' is returned when the second section is lower than the first and the third section is higher than the second. A ``fall'' is registered if there are no ``changes'' noted other than in a downwards direction: either the second section is lower than the first and the third is lower than the second, or the second does not ``change'' from the first, but the third is lower than either the second or the first, or the second is lower than the first, but the third does not change from the second. Similarly, ``rise'' is returned if there is at least one upward change noted and no downward ones. ``flat'' is returned otherwise. In the light of this, explanation of the relationship between pitch movements classified by the program should be seen as being only loosely related to the pitch movements a phonetician might perceptually categorise the speech that is analysed as having. As well as movement classification, this part of the program returns the first, last, and mean fundamental frequency values recorded in the analysed section of speech.

Next: Appendix 1 Up: Assessment methodologies and Previous: Assessing speaker verification

WWW Administrator
Fri May 19 11:53:36 MET DST 1995