Next: Recognition assessment Up: Assessment methodologies and Previous: Appendix 1

Appendix 2

This document describes the method we have used to analyse two-channel conversation data. The data has been digitised at 5KHz, and low-pass filtered at 2025Hz. This sampling rate was used rather than the more normal 10KHz rate for reasons of practicality: Some of the conversations are over 10 minutes long, and it proved difficult to reserve the requisite amount of contiguous disc space for a direct disc transfer of this length. Each conversation is stored in a single file with samples from each channel interlaced.

This analysis method may be broken up into two stages:

Locating sections of speech
The first task is to locate the beginning and end points of connected utterances. This is done using an algorithm based on Rabiner and Schaffer's algorithm for the detection of the start and endpoints of isolated utterances. The signal on a channel is analysed in 10ms windows. The presence or absence of speech is determined not only by absolute amplitude of the signal in the window, but also by the number of zero crossings. This latter measure is included so that low energy fricatives at the beginning and end of utterances are not ignored. Pauses of under 50ms are ignored by the algorithm. The algorithm takes one pass through the data. Start and end times of utterances are written to a file. There are two such files, each containing the start and end time information for a particular channel. An utterance which is ongoing in the last window of the file is deemed to end there. This algorithm appears to work quite well, but it cannot tell the difference between ``speech'' and other assorted noises made by the person recorded on a particular channel (sneezes, coughs, etc). This does not appear to be a problem, as most of the ``non speech'' utterances (``mm's'', ``uh-huhs'', etc.) appeared to be part of the dialogue. The files containing start and end times are used as input to the next phase of the analysis.
Determining the status of an utterance
Using the start and end time data from the utterance detection module, a process of logic is applied to determine for each utterance whether it constitutes a turn taken (an utterance occurs on a channel after a pause, where the last utterance was on the other channel), or an interruption (an utterance on the other channel was ongoing when the speaker began). Interruptions are further subdivided into ``successful'' interruptions (the interrupter's utterance continues after the utterance that was interrupted), ``unsuccessful'' (where the interrupted utterance continues past the end of that of the interrupter) and interruptions where both the interrupting and interrupted speakers finish their utterances simultaneously. ``Simultaneous'', in this study, is only an approximate term due to the 10ms analysis window method used in computing start and end times. Cases where a speaker begins an utterance, and there has been no activity on the other channel since the end of their last utterance (``Speaker continues after pause'') were not of interest in this study, as it is assumed that they were not cued by the other speaker's intonation. Where two speakers begin an utterance simultaneously, this counts as continuation after a pause for the speaker whose last utterance ended most recently (``the last person to speaker''), and as a turn take for the other speaker. If, in such a case, both speakers had also ended their last utterances simultaneously, a turn take would be made for both of them. If the beginning of a speaker's utterance is coincidant with the end of the other speaker's utterance, this is deemed to be an interruption rather than a turn take. The first utterance in a conversation is not counted.
One pass is made through the start and end times data. When an interesting utterance (turn or interruption) is made, the frequency analysis subroutine is called to analyse part of the utterance on the other channel which immediately precede the turn/interruption. In the case of a turn take, the offset passed to the frequency analysis is that of the end of the last utterance on the other channel. In the case of an interruption the offset is the point where the interruption occurred.
The ``latency'' figure in the results has a different meaning depending on whether it applies to a turn take or an interruption. In the case of a turn take, it represents the perceived duration of the pause between the end of the last utterance on the other channel and the beginning of the turn taking utterance. In the case of an interruption, it represents the duration of the period where utterances are perceived to be occurring on both channels.

)

Next: Recognition assessment Up: Assessment methodologies and Previous: Appendix 1

WWW Administrator
Fri May 19 11:53:36 MET DST 1995