Next: Spoken Language Lexica Up: Corpus representation Previous: Transcription of spoken

Storage and design of the data base

Data types for a speech data base

The previous chapter and section described the data collection and the representation of the speech. In this part information is given about how speech, speaker information, information about data collection, and transcriptions may be stored in a data base , that is a structured speech corpus. The design of the data base again very much depends on its future use.

Many additional sources of information can be gathered apart from the basic acoustic speech signal. Whatever choices of speakers, speech material, and recording conditions are made, it is always of crucial importance that the collecting procedure is documented as elaborately as possible. It is good practice to record all possible details about, for instance, sex and age of speakers , type of speech material (isolated words, sentences, discourse, etc.), place of recording (in a laboratory, on location, etc.), type of microphone and recording medium. Although you may not be interested in specific information at the time, it can turn out to be important at a later stage. In that case it is often difficult or impossible to recollect the information you need. And in the second place, a well documented speech corpus may also be used for other research. The following list summarises the most common information sources that may be useful for a speech corpus:

Transduced signals
Examples: The acoustic speech signal, laryngograph signal, X-ray data.
Analysis results
Examples: FFT data , LPC data , filter bank data , pitch extraction, formant extraction .
Descriptors
Examples: Characteristics of the speakers, or the recording conditions.
Markers
Examples: Markers to indicate pitch periods, or the beginning of vowels.
Annotations /Labels
Examples: Orthographic , phonemic , or phonetic transcriptions .
Assessment parameters
Examples: Test material, assessment results.

All these information sources must be stored in such a way that potential users of the speech corpus can get access to the speech and the speech-related data in an efficient and easy-to-use manner.

Storage demands

The sampled speech signals form the core of the speech corpus. Since these data do not need to be changed, they can conveniently be stored on static media, such as optical disks . Fixed speech-related information, such as speaker descriptors, etc., can also be stored on a static medium. However, speech-related information sources that are subject to change, such as transcriptions , should preferably be stored on, for instance, a high capacity computer disk. Analyses of the sampled speech data, such as FFT , formant extraction, etc., could be done `on line', provided that special signal processing software would be available. In this way permanent storage of analysis results could be avoided, leading to considerable savings of storage.

Storage medium

Nowadays, CD-ROMs are most frequently used for the distribution of speech corpora. A CD-ROM can contain 640 Mbyte. The amount of speech it actually contains depends of course on the used sampling rate and compression techniques. The file system needed for a CD-ROM depends on the computer system that one wants to use. This makes the use of CD-ROMs slightly problematic, since it is impossible to standardise the file system on it. For example, if one plans to use a CD-ROM on a Personal Computer working under DOS, one must take into account that the file names have the appropriate eight.three format. On-line access to speech corpora (often used for text corpora) is not propagated, since this is still very time consuming at this moment. DAT tapes can be used to store the corpus, but these tapes are not safe enough. One can also use exabyte tapes, but this medium is not very accessible.
In table we present an overview of storage media, together with their properties:

Table: Storage media classification:

In the near future, workstations are expected to be equipped with the same number of devices as today (i.e. still one hard disk, one CD), but of larger size or higher speed. In general, an improvement of 2 to 8-fold can be expected. Optical media, incl. CD-ROM will profit from shorter wave-length lasers currently being developed in research labs, but it will take until '98 for higher capacity devices to reach the market. Not much will change for tapes.

Sampling rates

If the original recordings have been made on DAT (which is strongly recommended), and the purpose of the corpus does not demand that the speech is stored with DAT quality at a sampling frequency of 48 kHz, sampling frequency can easily be reduced to 16 kHz. Rate conversion to 20 kHz, which seems to be the de-facto UK/EC standard for distribution on CD-ROM, can also be done without difficulty.

Compression

Sampled data files can be compressed and by doing this one can win on average 50% of the storage space. The amount of possible compression depends on the signal-to-noise ratio in the recordings. A high signal-to-noise ratio results in a better compression . Compression of telephone speech (A-law or U-law ) results in a compression of about 40%, as was experienced during the storage of the Dutch POLYPHONE material. A compression tool that is often used and is recommended here is Shorten by T. Robinson. It is available through anonymous ftp (svr-ftp.eng.cam.ac.uk)

How to combine speech and speech-related data

The simplest way to combine speech and speech-related data is to code information in the name of the sampled data file. For example, the filename ``w3_s7.sdf'' could be used for the speech recording of the third word (w3) in a word list, uttered by the seventh speaker (s7). The extension ``.sdf'' indicates that the file contains sampled data. It will be clear that the possibilities to code information in a filename are very limited. Much more space for additional speech-related information is available in the header of a sampled data file. Several data about, for instance, the speaker (sex, age, weight, height, etc.), or about microphone type, recording medium, etc. can be stored in the file header . Frequently used headers are NIST Sphere headers and SAM headers . Examples of these are presented in Appendix B. The choice for a specific header may depend on future use of the speech corpus. For example, if one plans to use the speech analysis program X-waves, the ESPS header used in this program, can be chosen.

When the amount of speech-related information becomes very large, it is recommended to use a separate file (or perhaps even more than one) for the storage of these data. The drawback of this method is that the speech-related information is no longer physically connected to the speech data it refers to. But with a proper administration in the database structure, this should pose no problems. An example of a data description file, containing all kinds of additional information that could be gathered along with the speech recordings, can be found in =1 (

; Millar et al. 1994) .

Database Management Systems

A Database Management System (DBMS) is a software system with the following properties:

Data definition and manipulation in a formal data model,
Safe storage of large amounts of data,
Application independent storage of data,
Controlled access to the data.

In the past, many spoken language ( SL) corpora have been stored in the file management system provided by the operating system. Data is accessed via shell-scripts, custom programs, and application packages, e.g. spread-sheets, statistics programs, etc.

However, there are many disadvantages to this approach:

Data is organised in an ad-hoc way,
Data structures reflect the underlying, physical data structures and the file system organisation
No automatic protection against data corruption,
No access control to the data.

Clearly, this approach should not be followed.

Data model

A data model is a formal description of entities and their relationships, and of operations allowed on the entities. A data model is independent of a specific DBMS implementation. An instance of a data model or database schema is a data definition based on the data types and the data manipulation commands provided by the DBMS.

Data modelling is the process of describing the world of interest in the terms of a data model.

The following data models can be discerned:

Hierarchical,
Network,
Relational,
Object-oriented,
Deductive.

The development of the data models can be characterised as a) continued abstraction from physical storage, and b) increase in expressive power.

Hierarchical data model

In the hierarchical data model, entities (for instance, speakers, recordings, or types of speech) are considered as record types. Record types are subdivided into fields. A record stores the information of one particular entity.

Record types are organised in hierarchical tree structures. Except for the topmost record type (the root), each record type in the tree, a node, has exactly one predecessor, and zero or more successor record types. A node with zero successors is a leaf. A record belongs to exactly one record type, and it may be linked to zero or more records in the successor record types.

The hierarchical data model thus considers all relationships between entities as 1:N relationships. Access to record types is possible by navigation through the tree, starting at the root; records are selected from the set of records of a record type.
Summarising, the hierarchical data model has the following properties:

Hierarchical organisation of entities in 1:N relationships.
Access through navigation starting at the root record type.
Obsolete data model.
Implementations: IMS.

Network data model

In the network data model record types are organised in a directed graph structure. A record type may have zero or more predecessor and successor record types. As in the hierarchical data model, the information on an entity is held in records. Unlike fields in the hierarchical data model, the fields of a record in the network data model may have multiple values.

A set type describes a 1:N relationship between two record types. Set types have names to distinguish the relationships of a record type. Since a record type may be linked to any other record type, complex relationships, e.g. N:M relationships, may be represented .

Access to record types is possible via special ``entry point'' record types from which dependent record types can be reached through navigation.

Summarising, the network data model has the following properties:

Directed graph organisation of entities with named 1:N relationships between any two record types.
Access through navigation starting at special ``entry point'' record types.
Obsolete data model.
Implementations: xxx.

Relational data model

The relational model is a data model with a thorough mathematical foundation. An attribute is a named set of atomic values, the domain. A relation table is a subset of the cross-product of the relation attributes.

The relational data model separates the logical data definition from physical storage.

Two data manipulation languages have been developed for the relational data model:

Relational algebra is a procedural language which consists of the set operators `union', `intersection', `difference' and the relational operators `selection', `projection', and `join'. The selection operator selects rows from a relation table that meet a selection condition. The projection operator selects specific attributes from a relation table. The join operator merges relation tables according to a comparison condition over attributes from the original relation tables.
The result of a relational operation is again a relation, so that the operators can be nested.
Relational calculus is a declarative language based on first-order predicate logic.
An expression is a term of the form with variables over a domain and COND a formula that is either true or false. The variable bindings on the left side of the vertical bar are the values returned as the result of a query, and COND consists of atomic formulas which are connected through logical AND and OR operators.

The relational data model makes no assumptions about the relationships between entities. In general, an entity is mapped to a relation table, with the properties of an entity described through the relation attributes. Meaningful relationships between entities are expressed through relational algebra operations or calculus expressions. Hence, complex relationships between entities can be expressed, with the notable expression of recursive relationships.

SQL

Relational algebra and calculus are formal data manipulation languages. As such they lack data definition constructs, and because of their formal nature, they are not well suited for interactive data manipulation. The language SQL ( Structured Query Language) is a relational database language which covers both data definition and manipulation and is close to English. It has become the de-facto language standard for relational DBMSs, and has been formally standardised by the ISO.

Example

The following example is a sample speaker and recording database schema implemented in SQL. First, relation tables are created using the data definition command CREATE with the appropriate arguments:

CREATE TABLE SPEAKER (
   ID, CHAR(8), PRIMARY KEY
   NAME, CHAR(20)
   FNAME, CHAR(20)
   SEX,
CHAR(1)
   DBIRTH, DATE)

CREATE TABLE RECORDING (
   ID, DECIMAL, PRIMARY KEY
   RECDATE, DATE
   MEDIUM, CHAR(8)
   LOCATION, CHAR(20)
   SPK, CHAR(8)
   FOREIGN KEY SPK REFERENCES SPEAKER)

The data types available in SQL are restricted to very simple character or number types of fixed length. Bit-stream data and complex data structures are not supported.

Then queries are formulated using the data manipulation language of SQL:

SELECT S.ID, S.NAME, S.FNAME, R.RECDATE
FROM
SPEAKER S, RECORDING R
WHERE S.DBIRTH > 12/27/60 AND R.SPK = S.ID

Besides being a database language for interactive database access, SQL has become, through its standardisation, increasingly popular as a programming language interface. External applications generate SQL code, which is then transmitted to the DBMS and evaluated there. The result relation is returned to the calling application for further processing (either as a whole or one tuple after the other with a cursor mechanism).

Most commercial relational DBMSs support SQL, but add their own extensions to overcome the limitations of the standard data types, e.g. with binary large objects /it BLOBs for image, audio or video data, or complex data structures for graphics objects.

Summary

The relational model is a mathematical model.
Logical data types are completely independent of physical storage.
Data is manipulated using procedural relational algebra or declarative relational calculus.
Relational DBMSs are now standard DBMS technology.
SQL is the de-facto standard relational database language, but each DBMS implementation extends the standard in a proprietary way.
Implementations: DB2, Oracle, SyBase, Ingres, and many others; available for all platforms (Mainframe, UNIX, PCs, Mac).

Object-oriented data model

The object-oriented data model aims at bridging the semantic gap between relation tables and entities of the real world through objects that directly correspond to entities. An object has a unique and immutable object identifier, and it belongs to a class. Classes are object definitions; they comprise attributes and operations over the class or the attributes. Both attributes and operations may be private to a class, i.e. visible only to objects belonging to that class ( encapsulation), or public, i.e. visible to other classes. Classes are organised in class hierarchies and may inherit attributes and operations from superclasses. Operations are invoked by sending an object messages ( message passing) which are executed if the required operation is defined for the object, otherwise they are passed on to the superclass.

In general, object-oriented DBMSs are integrated into an object-oriented programming language, giving them the full expressive power of the programming language and persistent object storage. Object-oriented DBMSs have just entered the marketplace, and they are successful in non-standard applications which require the full expressive power of programming languages, or complex and highly diverse data structures.

Summary

The object-oriented data model is based upon the notion of object.
An object belongs to a class, and classes are organised in a hierarchy with inheritance of properties from super- to subclasses.
Object-oriented DBMSs now enter the marketplace.
There is not yet a standard object-oriented database language.
Implementations: O, ObjectStore, POSTGRES, Starburst, GemStone and many others.

Deductive data model

The deductive data model is a restricted first-order predicate logic extension of the relational data model. In the deductive data model, relations are either defined extensionally through facts, or intensionally through rules. Rules may be defined recursively.

Various formal languages have been developed for the deductive data model, e.g. Datalog, a function-free sublanguage of first-order predicate logic, or Datalog which includes negation.

Logic programming languages, e.g. Prolog, extend Datalog with complex data structures. Current Prolog systems feature access to external DBMS or an internal database component for the persistent storage of large amounts of data.

Summary

The deductive data model is a logic based extension of the relational data model.
Data is stored as facts and rules.
Persistent logic programming languages combine the declarative logic language with persistent data storage.
Implementations: LDL, Eclipse, Prolog.

Safe storage of data

DBMSs provide safety features that prevent the loss of data due to hardware failure. The basic mechanism is that of transactions. A transaction is a sequence of data definition or manipulation commands that is considered as atomic by the DBMS; a transaction either succeeds completely, or fails and undoes all data changes ( rollback).

During a transaction, the DBMS works on a copy of the data; if the transaction succeeds, the copy is written to permanent storage, otherwise it is simply discarded.

Transactions may be visible to the user. Other mechanisms to preserve the physical integrity of data are invisible:

Commit points, i.e. points in time at which the database contents is dumped to a permanent storage medium in a consistent way.
Redundant data storage, e.g. through a RAID-array (Rapid access to inexpensive drives) organisation of disks where redundant data is distributed over hard disks to allow the reconstruction of a corrupted hard disk, or mirroring, where exact duplicates of devices are used.

Application-independent storage of data

Data held in a DBMS has a greater lifetime than the applications using the data. In fact, in most cases data will outlive the DBMS in which it is stored.

This means that data storage has to be organised independently of any application so that different applications (or generations of applications) can access the same data efficiently, and that import and export procedures have to be provided for the migration of data to new DBMSs.

In DBMSs, application independence is achieved by hiding the physical storage of data from the users. The DBMS automatically maps the storage requirements of a database schema to the file system of the operating system, creates a meta-schema which contains all information about the data stored in the DBMS and generates index files to speed up access to the data.

External applications cannot access data stored in a DBMS directly, e.g. by opening a file. On the one hand, this is a further security measure by which tampering with data is prevented, on the other hand this places the burden of determining efficient data storage organisation and optimising access to the data on the DBMS.

Controlled access to data

In general, DBMSs are multi-user systems, i.e. many users or applications may access the same data in parallel. In order to prevent the interference of user operations, access to the data is controlled by the DBMS through user-specific views and access privileges.

A user-specific view or sub-schema defines which entities and relationships are visible to the user. The entities and relationships visible to the user must not be identical to those of the global database schema.

Access privileges define what mode of access a user has to the data. Access privileges can be defined for the global database schema, the sub-schema, or individual entity sets (e.g. tuples in a relation table, objects in a class). The access mode is either read or write. In general, a user can grant his access privileges to other users.

Summary

In SLP, DBMSs should be used for speech corpora of any size.

The special requirements of SLP, namely

the size of signal data,
the complex datastructures for an adequate representation of data on different levels of abstraction, and
the permanence of the data

cannot be met with the hierarchical and network data models.

Even the relational data model in its pure form is not well suited for SL-corpora because it lacks support for bit-stream data and complex data structures. However, all current commercial relational DBMS implementations support binary large objects and a richer type system, which makes them candidate DBMSs for SL-corpora. The major advantage of relational DBMSs is that they are now a standard and proven technology.

Object-oriented DBMSs, especially in conjunction with object-oriented programming languages, and persistent logic programming languages provide both a rich inventory of data types, full computational power, and a good support of the basic DBMS functionality. They are thus well suited for SL-corpora.

The major drawback of object-oriented DBMSs is that there is not yet a common standard object-oriented database language. The major drawback of persistent logic programming languages is that there exist only a few implementations of such systems. However, the logic based formalism is basically the same as in phonology and linguistics, and thus it is possible to use a single formalism on all levels from phonetics to linguistics.

The major advantages of a DBMS can thus be summarised as follows:

Logically coherent data definition in a formal data model.
Uniform access to data in a database schema.
Stable database schema, flexible amount of data.
Interactive and application program interfaces for ad-hoc querying and inter-application communication.

DBMSs are, for instance, applied in the GRECO project (cf. =1 (

; Carre 1992) , or the Alvey STA project described in =1 (

; Thomas Winski 1987) ), or the PhonDat-Verbmobil database of spoken German (cf. =1 (

; Draxler 1995) ).

It will be clear that only the basic characteristics of DBMSs are mentioned above. The interested reader should consult =1 (

; Elmasri Navathe 1989) , or =1 (

; Ceri Gottlob Tanca 1990) or any other recent database introductory text book.

)

Next: Spoken Language Lexica Up: Corpus representation Previous: Transcription of spoken

WWW Administrator
Fri May 19 11:53:36 MET DST 1995