next up previous contents
Next: Spoken Language Lexica Up: Corpus representation Previous: Transcription of spoken

Storage and design of the data base

Data types for a speech data base

The previous chapter and section described the data collection and the representation of the speech. In this part information is given about how speech, speaker information, information about data collection, and transcriptions  may be stored in a data base , that is a structured speech corpus. The design of the data base  again very much depends on its future use.

Many additional sources of information can be gathered apart from the basic acoustic speech signal. Whatever choices of speakers, speech material, and recording conditions are made, it is always of crucial importance that the collecting procedure is documented as elaborately as possible. It is good practice to record all possible details about, for instance, sex  and age of speakers , type of speech material (isolated words, sentences, discourse, etc.), place of recording (in a laboratory, on location, etc.), type of microphone and recording medium. Although you may not be interested in specific information at the time, it can turn out to be important at a later stage. In that case it is often difficult or impossible to recollect the information you need. And in the second place, a well documented speech corpus may also be used for other research. The following list summarises the most common information sources that may be useful for a speech corpus:

All these information sources must be stored in such a way that potential users of the speech corpus can get access to the speech and the speech-related data in an efficient and easy-to-use manner.

Storage demands

The sampled speech signals form the core of the speech corpus. Since these data do not need to be changed, they can conveniently be stored on static media, such as optical disks . Fixed speech-related information, such as speaker descriptors, etc., can also be stored on a static medium. However, speech-related information sources that are subject to change, such as transcriptions , should preferably be stored on, for instance, a high capacity computer disk. Analyses of the sampled speech data, such as FFT , formant extraction, etc., could be done `on line', provided that special signal processing software would be available. In this way permanent storage of analysis results could be avoided, leading to considerable savings of storage.

Storage medium

Nowadays, CD-ROMs are most frequently used for the distribution  of speech corpora. A CD-ROM can contain 640 Mbyte. The amount of speech it actually contains depends of course on the used sampling rate  and compression  techniques. The file system needed for a CD-ROM depends on the computer system that one wants to use. This makes the use of CD-ROMs slightly problematic, since it is impossible to standardise the file system on it. For example, if one plans to use a CD-ROM on a Personal Computer working under DOS, one must take into account that the file names have the appropriate eight.three format. On-line access  to speech corpora (often used for text corpora) is not propagated, since this is still very time consuming at this moment. DAT tapes can be used to store the corpus, but these tapes are not safe enough. One can also use exabyte tapes, but this medium is not very accessible.
In table gif we present an overview of storage media, together with their properties:

  
Table: Storage media classification:

In the near future, workstations are expected to be equipped with the same number of devices as today (i.e. still one hard disk, one CD), but of larger size or higher speed. In general, an improvement of 2 to 8-fold can be expected. Optical media, incl. CD-ROM will profit from shorter wave-length lasers currently being developed in research labs, but it will take until '98 for higher capacity devices to reach the market. Not much will change for tapes.

Sampling rates

If the original recordings have been made on DAT (which is strongly recommended), and the purpose of the corpus does not demand that the speech is stored with DAT quality at a sampling frequency  of 48 kHz, sampling frequency  can easily be reduced to 16 kHz. Rate conversion to 20 kHz, which seems to be the de-facto UK/EC standard for distribution  on CD-ROM, can also be done without difficulty.

Compression

Sampled data files can be compressed and by doing this one can win on average 50% of the storage space. The amount of possible compression  depends on the signal-to-noise  ratio in the recordings. A high signal-to-noise  ratio results in a better compression . Compression  of telephone speech (A-law  or U-law ) results in a compression  of about 40%, as was experienced during the storage of the Dutch POLYPHONE  material. A compression  tool that is often used and is recommended here is Shorten by T. Robinson. It is available through anonymous ftp (svr-ftp.eng.cam.ac.uk)

How to combine speech and speech-related data

The simplest way to combine speech and speech-related data is to code information in the name of the sampled data file. For example, the filename ``w3_s7.sdf'' could be used for the speech recording of the third word (w3) in a word list, uttered by the seventh speaker (s7). The extension ``.sdf'' indicates that the file contains sampled data. It will be clear that the possibilities to code information in a filename are very limited. Much more space for additional speech-related information is available in the header  of a sampled data file. Several data about, for instance, the speaker (sex, age, weight, height, etc.), or about microphone type, recording medium, etc. can be stored in the file header . Frequently used headers  are NIST Sphere headers  and SAM headers . Examples of these are presented in Appendix B. The choice for a specific header  may depend on future use of the speech corpus. For example, if one plans to use the speech analysis program X-waves, the ESPS header  used in this program, can be chosen.

When the amount of speech-related information becomes very large, it is recommended to use a separate file (or perhaps even more than one) for the storage of these data. The drawback of this method is that the speech-related information is no longer physically connected to the speech data it refers to. But with a proper administration in the database structure, this should pose no problems. An example of a data description file, containing all kinds of additional information that could be gathered along with the speech recordings, can be found in =1 (

; Millar et al. 1994) .

Database Management Systems

A Database Management System (DBMS)  is a software system with the following properties:

In the past, many spoken language ( SL) corpora have been stored in the file management system provided by the operating system. Data is accessed via shell-scripts, custom programs, and application packages, e.g. spread-sheets, statistics programs, etc.

However, there are many disadvantages to this approach:

Clearly, this approach should not be followed.

Data model

A data model  is a formal description of entities and their relationships, and of operations allowed on the entities. A data model is independent of a specific DBMS implementation. An instance of a data model or database schema is a data definition based on the data types and the data manipulation commands provided by the DBMS.

Data modelling is the process of describing the world of interest in the terms of a data model.

The following data models can be discerned:

The development of the data models can be characterised as a) continued abstraction from physical storage, and b) increase in expressive power.

Hierarchical data model

  In the hierarchical data model, entities (for instance, speakers, recordings, or types of speech) are considered as record types. Record types are subdivided into fields. A record stores the information of one particular entity.

Record types are organised in hierarchical tree structures. Except for the topmost record type (the root), each record type in the tree, a node, has exactly one predecessor, and zero or more successor record types. A node with zero successors is a leaf. A record belongs to exactly one record type, and it may be linked to zero or more records in the successor record types.

The hierarchical data model thus considers all relationships between entities as 1:N relationships. Access to record types is possible by navigation through the tree, starting at the root; records are selected from the set of records of a record type.
Summarising, the hierarchical data model has the following properties:

Network data model

  In the network data model record types are organised in a directed graph structure. A record type may have zero or more predecessor and successor record types. As in the hierarchical data model, the information on an entity is held in records. Unlike fields in the hierarchical data model, the fields of a record in the network data model may have multiple values.

A set type describes a 1:N relationship between two record types. Set types have names to distinguish the relationships of a record type. Since a record type may be linked to any other record type, complex relationships, e.g. N:M relationships, may be represented .

Access to record types is possible via special ``entry point'' record types from which dependent record types can be reached through navigation.

Summarising, the network data model has the following properties:

Relational data model

  The relational model is a data model with a thorough mathematical foundation. An attribute is a named set of atomic values, the domain. A relation table is a subset of the cross-product of the relation attributes.

The relational data model separates the logical data definition from physical storage.

Two data manipulation languages have been developed for the relational data model:

The relational data model makes no assumptions about the relationships between entities. In general, an entity is mapped to a relation table, with the properties of an entity described through the relation attributes. Meaningful relationships between entities are expressed through relational algebra operations or calculus expressions. Hence, complex relationships between entities can be expressed, with the notable expression of recursive relationships.

SQL

Relational algebra and calculus are formal data manipulation languages. As such they lack data definition constructs, and because of their formal nature, they are not well suited for interactive data manipulation. The language SQL ( Structured Query Language) is a relational database language which covers both data definition and manipulation and is close to English. It has become the de-facto language standard for relational DBMSs, and has been formally standardised by the ISO.

Example

The following example is a sample speaker and recording database schema implemented in SQL. First, relation tables are created using the data definition command CREATE with the appropriate arguments:

CREATE TABLE SPEAKER (
   ID, CHAR(8), PRIMARY KEY
   NAME, CHAR(20)
   FNAME, CHAR(20)
   SEX,
CHAR(1)
   DBIRTH, DATE)

CREATE TABLE RECORDING (
   ID, DECIMAL, PRIMARY KEY
   RECDATE, DATE
   MEDIUM, CHAR(8)
   LOCATION, CHAR(20)
   SPK, CHAR(8)
   FOREIGN KEY SPK REFERENCES SPEAKER)

The data types available in SQL are restricted to very simple character or number types of fixed length. Bit-stream data and complex data structures are not supported.

Then queries are formulated using the data manipulation language of SQL:

SELECT S.ID, S.NAME, S.FNAME, R.RECDATE
FROM
SPEAKER S, RECORDING R
WHERE S.DBIRTH > 12/27/60 AND R.SPK = S.ID

Besides being a database language for interactive database access, SQL has become, through its standardisation, increasingly popular as a programming language interface. External applications generate SQL code, which is then transmitted to the DBMS and evaluated there. The result relation is returned to the calling application for further processing (either as a whole or one tuple after the other with a cursor mechanism).

Most commercial relational DBMSs  support SQL, but add their own extensions to overcome the limitations of the standard data types, e.g. with binary large objects /it BLOBs for image, audio or video data, or complex data structures for graphics objects.

Summary

Object-oriented data model

  The object-oriented data model aims at bridging the semantic gap between relation tables and entities of the real world through objects that directly correspond to entities. An object has a unique and immutable object identifier, and it belongs to a class. Classes are object definitions; they comprise attributes and operations over the class or the attributes. Both attributes and operations may be private to a class, i.e. visible only to objects belonging to that class ( encapsulation), or public, i.e. visible to other classes. Classes are organised in class hierarchies and may inherit attributes and operations from superclasses. Operations are invoked by sending an object messages ( message passing) which are executed if the required operation is defined for the object, otherwise they are passed on to the superclass.

In general, object-oriented DBMSs are integrated into an object-oriented programming language, giving them the full expressive power of the programming language and persistent object storage. Object-oriented DBMSs have just entered the marketplace, and they are successful in non-standard applications which require the full expressive power of programming languages, or complex and highly diverse data structures.

Summary

Deductive data model

  The deductive data model is a restricted first-order predicate logic extension of the relational data model. In the deductive data model, relations are either defined extensionally through facts, or intensionally through rules. Rules may be defined recursively.

Various formal languages have been developed for the deductive data model, e.g. Datalog, a function-free sublanguage of first-order predicate logic, or Datalog which includes negation.

Logic programming languages, e.g. Prolog,   extend Datalog with complex data structures. Current Prolog systems feature access to external DBMS or an internal database component for the persistent storage of large amounts of data.

Summary

Safe storage of data

DBMSs provide safety features that prevent the loss of data due to hardware failure. The basic mechanism is that of transactions. A transaction is a sequence of data definition or manipulation commands that is considered as atomic by the DBMS; a transaction either succeeds completely, or fails and undoes all data changes ( rollback).

During a transaction, the DBMS works on a copy of the data; if the transaction succeeds, the copy is written to permanent storage, otherwise it is simply discarded.

Transactions may be visible to the user. Other mechanisms to preserve the physical integrity of data are invisible:

Application-independent storage of data

Data held in a DBMS has a greater lifetime than the applications using the data. In fact, in most cases data will outlive the DBMS in which it is stored.

This means that data storage has to be organised independently of any application so that different applications (or generations of applications) can access the same data efficiently, and that import and export procedures have to be provided for the migration of data to new DBMSs.

In DBMSs, application independence is achieved by hiding the physical storage of data from the users. The DBMS automatically maps the storage requirements of a database schema to the file system of the operating system, creates a meta-schema which contains all information about the data stored in the DBMS and generates index files to speed up access to the data.

External applications cannot access data stored in a DBMS directly, e.g. by opening a file. On the one hand, this is a further security measure by which tampering with data is prevented, on the other hand this places the burden of determining efficient data storage organisation and optimising access to the data on the DBMS.

Controlled access to data

In general, DBMSs are multi-user systems, i.e. many users or applications may access the same data in parallel. In order to prevent the interference of user operations, access to the data is controlled by the DBMS through user-specific views and access privileges.

A user-specific view or sub-schema defines which entities and relationships are visible to the user. The entities and relationships visible to the user must not be identical to those of the global database schema.

Access privileges define what mode of access a user has to the data. Access privileges can be defined for the global database schema, the sub-schema, or individual entity sets (e.g. tuples in a relation table, objects in a class). The access mode is either read or write. In general, a user can grant his access privileges to other users.

Summary

In SLP, DBMSs should be used for speech corpora of any size.

The special requirements of SLP, namely

cannot be met with the hierarchical and network data models.

Even the relational data model in its pure form is not well suited for SL-corpora because it lacks support for bit-stream data and complex data structures. However, all current commercial relational DBMS implementations support binary large objects and a richer type system, which makes them candidate DBMSs for SL-corpora. The major advantage of relational DBMSs is that they are now a standard and proven technology.

Object-oriented DBMSs, especially in conjunction with object-oriented programming languages, and persistent logic programming languages provide both a rich inventory of data types, full computational power, and a good support of the basic DBMS functionality. They are thus well suited for SL-corpora.

The major drawback of object-oriented DBMSs is that there is not yet a common standard object-oriented database language. The major drawback of persistent logic programming languages is that there exist only a few implementations of such systems. However, the logic based formalism is basically the same as in phonology and linguistics, and thus it is possible to use a single formalism on all levels from phonetics to linguistics.

The major advantages of a DBMS  can thus be summarised as follows:

DBMSs are, for instance, applied in the GRECO project (cf. =1 (

; Carre 1992) , or the Alvey STA project described in =1 (

; Thomas Winski 1987) ), or the PhonDat-Verbmobil database of spoken German (cf. =1 (

; Draxler 1995) ).

It will be clear that only the basic characteristics of DBMSs  are mentioned above. The interested reader should consult =1 (

; Elmasri Navathe 1989) , or =1 (

; Ceri Gottlob Tanca 1990) or any other recent database introductory text book.

)



next up previous contents
Next: Spoken Language Lexica Up: Corpus representation Previous: Transcription of spoken



WWW Administrator
Fri May 19 11:53:36 MET DST 1995