The Multilingual Edge

on page 124 Features The Multilingual Edge

Machine translation of human languages makes sense if you deal with large volumes of documents written in other languages

Peter M. Benton

In the past few years, the world has shrunk into a village, and age-old barriers to communication have fallen. The Iron Curtain has been dismantled, Germany unified, the channel connecting England and France built, and the European Economic Community born.

Glasnost, or openness, the watchword of the new Soviet order, is founded on improved communication--person to person, person to institution, business to business. With rapidly accelerating globalization, the economic necessity for people to do business in dozens of languages has spurred the demand for fast, accurate, and easy-to-use translation systems.

To conduct business globally, all types of documents must travel across the boundaries of countries. Many of these are good candidates for machine translation.

Language translation, natural or automated, meets two complementary needs: telling others what you have to offer (information dissemination) and keeping track of the outside world (information acquisition). Although for the foreseeable future people will still play an essential role in translation, machine translation has the potential to improve productivity and consistency dramatically.

A sampling of some business-oriented information dissemination includes sales and advertising literature, product operations instructions and service procedure data, and technical and academic literature. Many dissemination applications also have an information acquisition side--for example, daily correspondence; economic, commercial, and military news; and business and personal conversation.

Information dissemination and acquisition require different capabilities from a translation system. Dissemination, which is the more common application, requires smooth interaction with a publishing system. Acquisition, such as tracking technical advances and news, requires the ability to communicate with a variety of input devices. Because computers and other appliances have become more and more our partners, it has become vital that they communicate with us in our own natural languages.

Most of the commercial-grade automated translation systems started out on mainframes. But they are now, or soon will be, available or accessible from workstations such as the Sun SPARC workstation and server, as well as from many 386 and 486 platforms (see the photo). Workstation availability makes it easy to tie the dissemination and acquisition aspects of translation into publishing and input devices.

The Translation Process

A basic translation system consists of a workstation, translation software, and a substantial electronic bilingual dictionary. The software may be written in a variety of computer languages--C, Lisp, FORTRAN, or PL/I--depending on the system's history. The dictionary contains tens of thousands of words coded to show what parts of speech they represent and the semantic categories they occupy.

The translation systems used in information dissemination are integrated with or can exchange files with publishing systems such as those from Interleaf. The translation systems that are used for information acquisition are integrated with optical character recognition (OCR) scanners and other input devices.

Not every type of text is a suitable candidate for automated translation. Poor candidates include turgid technical and academic writing, transcripts of spoken conversation, advertisements, and creative literature. However, for the right types of texts, users typically experience several benefits. Automation reduces overall document translation time because raw translation is faster. Terminology in the target copy is more consistent because the machine refers to the database rather than to human experience. And composition costs in the target language decrease substantially when markup coding is used (more on that later).

Despite advances in the state of the art, no black box exists that reliably translates typical human language in a completely unattended manner. For that matter, it is rare to have a document professionally translated by only one person. Generally, in information dissemination, a translator does the bulk of the work, and a post-editor checks and polishes the text of the finished document.

In information acquisition, translators often use a two-tiered approach. An initial rough translation of a page or two is prepared and reviewed by a subject matter expert. If the text appears useful, the document is then translated and post-edited.

Whether performed by a person or a machine, though, translation undergoes a five-stage process: input, analysis, transfer, synthesis, and output. Depending on the kind of translation performed, automation improves the process in several areas.

Uno: Input

Input involves getting the raw copy into the appropriate form for processing. In natural translation, human translators read the copy and translate it sentence by sentence, and simultaneous interpreters listen to the spoken word and translate it thought by thought. Today, for machine translation, computers must be spoon-fed the copy in a digital form as ASCII text (although this limitation is giving way to technology such as OCR and voice recognition).

Another facet of the input stage is the collection and organization of the appropriate terminology. A translator of molecular biology articles, for instance, likely will need to gather technical literature in the target language to see how others in the field spell and use specialized terms. New terminology and new meanings for existing terminology are growing far more rapidly than new editions of dictionaries are published. Consequently, expanding the terminology database is the most important maintenance task a user of an automated language-translation system performs.

To translate properly, the system must have terminology in both the source and the target languages. It also must have the rules for applying the terminology correctly in the analysis and the synthesis stages.

With an automated language translation system, the user needs to add a new term or meaning (with associated rules) only once. By contrast, in human translation, that term has to be researched perhaps scores of times by individual translators working on different documents at different times.

The input stage for automated translation can be easy or difficult (i.e., expensive or inexpensive) depending on the form of the data to be translated. For accuracy, it's best to start with word processing or ASCII files instead of paper. Transferring language from hard copy into electronic form is costly and error prone.

In information dissemination applications, the translated copy often will be republished in the target language. Thus, carrying structural or typographic attributes from the source may be desirable so that markers for heads and subheads, table rows and columns, numbered lists, italics, underscores, and other emphasis marks can be reused in the target language.

Commercial-grade translation systems have table-driven tools that recognize markup codes and record their positions in the linear text so that they can be regenerated after transfer. Special rules are needed to handle phrases, because the word order may change or the phrase may be broken up entirely in the translation process (see below).

Deux: Analysis

Analysis consists of parsing (simple or sophisticated) and, in some systems, semantic disambiguation to clarify the syntax of a sentence. The disambiguation process decides what is meant when multiple interpretations are possible. Automated translation systems record only enough of the semantics to reduce the chances of getting a wrong parse.

Simple parsing yields a grammatical representation of a sentence (see figure 1a). On a treelike diagram showing the grammatical relationships, every word is positioned according to its part of speech and its relationship to other words in the sentence.

Sophisticated parsing yields all possible representations of the sentence's syntax (see figure 1b). Beyond that, a more elaborate parse can identify the role of subjects and objects in the sentence, and describe their actions and attributes. The result of parsing and disambiguation, the coded ``interlingua representation,'' is a series of complex records--typically one record per original input sentence (see figure 1c).

You can think of the interlingua as the essence of the sentence in a logical structure. Attributes of the interlingua are stored in a standardized form. Research has been under way for many years now on the development of systems that possess natural language understanding. Such systems can identify the role of subjects and objects in the sentence and describe their actions and attributes. If these research efforts are successful, natural language understanding capabilities will be added to translation systems.

A system intended to work with many languages must have a rich interlingua representation so that it can record all the classes of distinctions used in any of the languages. A great deal of variation exists in the data structures that different translation systems use. Each data structure reflects the linguistic expertise of the system's architect, as well as the intended use of the system. For instance, the Distributed Language Translation system being developed in the Netherlands (see the text box ``Translation Technology Alternatives'' at right) uses Esperanto as its interlingua. The Esperanto language was invented in the late 1800s for scientific discourse.

Drei: Transfer

During the transfer process, systematic changes are made to the interlingua representation so that it can be used to generate copy. In essence, the transfer process moves markers for all linguistic characteristics to the new positions needed for the next stage, synthesis.

In human beings, the transfer process is automatic and hidden from view. But in computers, the interlingua representation is highly formalized and does not resemble linear copy. The system needs one transfer algorithm for every target language. Transfer algorithms are tightly integrated with the interlingua and play an important role in handling complex sentences accurately.

The system performs many operations during the transfer process, among them the conversion of the treelike representation into a linear series of tokens (e.g., verbs, nouns, pronouns, and adjectives). The token sequence reflects the appropriate ordering of sentence parts (e.g., subject, verb, and object) in the target language; for example, verbs appear in different characteristic positions depending on the language.

Selection of the appropriate substructures for clauses and phrases also occurs during the transfer process. Acceptable clause and phrase structures differ markedly; for example, Spanish compound-noun clauses tend to include many linking words, while German noun clauses simply string nouns together.

Associating linguistic markers for tense, number, aspect, gender, and so forth with the tokens is another transfer operation. For example, languages that use gender and plurality (e.g., French) reflect these characteristics in nouns, pronouns, prepositions, and verbs. The French word for the can be le, la, or les depending on the gender and number of the related noun.

Four: Synthesis

In the synthesis stage of machine translation, the ordered sequence of linguistic tokens is converted into language. The result of synthesis is sentences (perhaps with typographic markup) in the target language.

Once again, in humans, the synthesis process is automatic and hidden from view. In computers, however, much of synthesis is simple lookup and replacement, while other parts of the process are more elaborate.

Synthesis of prepositions (e.g., at, in, on, and by) and pronouns (e.g., this, that, who, what, I, and you) is straightforward. Strings of tokens can be immediately replaced by words. Synthesis of nouns and verbs, though, often requires intelligent selection among a range of candidates, and the choice depends on the appropriate jargon for that translation subject area.

Many problems can arise during synthesis. Synthesis is especially difficult in fields such as law, where underlying philosophies may vary substantially from country to country. But problems can also occur in other fields. For example, in an automatic translation of a medical text from English to Spanish, the English word nostril was translated into a Spanish word equivalent to the English word vent instead of the phrase orificios de la nariz. Since the lexical database did not contain orificios de la nariz, it used a ``next best'' term, which was suitable for inanimate objects only.

Cinque: Output

Human-performed translation yields written documents (often produced using desktop publishing) or a spoken utterance. In automated translation, however, the user typically has several choices of output, depending on the application for the translation. Some systems provide an editing environment, and others produce files for subsequent editing in a word processor.

Usually, automated translation output is formatted as side-by- side reports, single language reports, or word processing files. The user also has the option of seeing error flags.

A side-by-side report or word processing file would show the source copy on one side and the target copy on the other. If present, error flags would be shown on the line where the error was detected. With help from the side-by-side report, the user can check and adjust the translation. Error codes draw the eye to areas needing attention. In production translation operations, the principal users of the side-by-side report or word processing file are post-editors, who polish the translation into final form. Secondary users are terminology experts, who examine what the post-editors have done and adjust the terminology database accordingly.

Special Features

With some automated translation systems, users can choose special features that are useful in particular applications. These features include repetitions processing, microglossaries, and stylistic personalization.

Repetitions processing saves time in translation of product documentation. Documentation for equipment, systems, and software usually changes in only minor ways when new versions of the product come out. Repetitions processing works by keeping a database of all sentences processed by the system and only retranslating new and changed sentences.

Most commercial-grade translation systems have a microglossary feature. The user specifies the subject area the source text comes from, and the microglossary contains equivalent terms that apply to that field. This capability makes it possible to translate text without large knowledge bases.

The next step beyond a microglossary is stylistic personalization. With this feature, the system watches what the post-editor does and infers rules for use in future translations. Stylistic personalization is useful when large volumes of material are being translated for a specific purpose, such as operations instructions or loan contracts.

Pursuing Machine Translation

You should consider implementing machine translation in your company if any of the following applies: your business plan is going global and your products require lots of documentation, the documentation needs to be translated more rapidly and consistently, or you need to track new developments in other countries to stay competitive.

Fortunately, because of the increasing availability of translation systems on personal computers and workstations, you can run a test project and monitor results in a controlled environment. Thus, on a small scale, you can assess whether machine translation can help you meet these needs.

More than with other applications, though, you need to thoroughly test the systems. To test a system you would use the following procedure: (1) Run through the systems the kinds of materials to be translated; (2) give the translations to experienced post-editors; and (3) analyze the overall final results to see if the system meets your particular needs.

Adopting machine translation requires a great deal of learning and dedication. Work flow, job descriptions, and habits must change. The fact that raw translation is now performed by a machine, rather than people, transforms the fundamental nature of the process.

Per qualsevol problema amb aquesta pàgina, contacteu "de_yza@upf.es"

Per comentaris i observacions sobre el servidor, poseu-vos en contacte amb l'Administrador WWW