How MT Works

on page 167 State Of The Art How MT Works

MT systems tackle translation in several ways--each has its own benefits and trade-offs

Eduard Hovy

Languages are complicated, and this makes translation a formidable challenge. Each language is unique, with elements that make it colorful and special. But that's precisely what makes languages difficult to translate, whether a human or a machine does the job.

For example, the Dani people of New Guinea have only two words to describe color: mili for dark cool and mola for light warm. How would you tell a Dani store owner, "I prefer the strawberry-red shirt over the pink one"? Fortunately, automated translation is feasible, because some applications (e.g., technical documents) are straightforward. As MT (machine translation) systems evolve, they will take on the challenge of general translation.

MT systems become more complex as you move from the simplest direct systems to transfer systems to the most complex interlingual systems. You can also categorize MT systems another way: those whose knowledge bases are built by humans and those that collect their rules statistically.

Direct-Translation Systems

Software that translates languages by replacing source-language words with target-language words is called a direct-translation system. Such a translation system is appropriate for applications where you translate text that has a limited vocabulary and a defined style. Direct MT systems contain correspondence lexicons, or lists of typical patterns of words and phrases in a source language and the corresponding target-language words and phrases. Depending on the size of the system's correspondence lexicon and on how cleverly the replacement patterns are defined, the resulting text is more or less readable in the target language.

Let's say you want to translate the following sentence into German: Washington announced yesterday that each home will contain an MT system by the end of the decade. Assume that the system builder hasn't defined Washington, home, or MT system in the lexicon. When the MT system translates this sentence, it places these words unaltered into the target- language sentence. The lexicon produces the German equivalent for the words and phrases that have been defined, substituting ankndigte for announced, gestern for yesterday, and bis zum Ende des Jahrzehnts for by the end of the decade. Untranslated words are bracketed by the system.

The result looks like this: Washington ankndigte gestern dass jeder home wird haben ein MT system bis zum Ende des Jahrzehnts. Although the result is horrible German, the sentence is understandable. Most German- speaking people would not have a problem rearranging the words and changing the inflection to come up with a better version: Washington kndigte gestern an, dass bis zum Ende des Jahrzehnts jeder Haushalt ein M-System haben wrde.

Direct MT systems handle substitution from English to German relatively well, but they have a problem handling other languages. For instance, Japanese requires the addition of several preposition-like particles that indicate the role of each part of the sentence (e.g., subject, object, and cause). The Australian aboriginals combine all the parts of their sentences, including separate markers for tense and number, to form one long sentence-word.

Another problem with direct MT systems is the need for massive lexicons of specific words and phrases. As the systems grow, the lexicons become more cumbersome. It's redundant, for example, to store separate entries for announced, announces, and announce. This problem plagued early direct MT systems, such as Georgetown University's system in the 1950s and the first versions of the Systran from Systran Translation Systems (La Jolla, CA) in the 1960s.

Usually, system builders try to factor out commonalities by creating a root form and rules for variations, but then they must create additional routines to handle the inflection. Most direct MT systems include some analysis of word form and structure. By doing so, they take the first step toward the more sophisticated technology of transfer systems.

Syntactic Transfer Systems

Transfer systems use software to analyze the input sentence and then apply linguistic and lexical rules, called transfer rules, to map grammatical information from one language to another. The simplest transfer rules specify only the syntactic structure of the sentence (i.e., how it's constructed of nouns, verbs, and other grammatical objects). To identify the structure of the input sentence, transfer systems use software called parsers. Although there are hundreds of different parsers and dozens of theories regarding syntax, most parsers would come up with the type of sentence analysis that you see in figure 1.

After creating a parsing tree, the system uses its transfer rules to rewrite the tree so that it obeys the syntax of the target language. In figure 1, this mapping is straightforward (it's often more complex), and it produces the output that you see in figure 2. Once the target- language tree has been built, the system's sentence generator builds the sentence, making sure the words are the appropriate tense and number.

It's difficult to delineate where direct MT systems end and where transfer systems begin. Systran, one of the most successful systems produced, began as a direct system and evolved into a syntactic transfer system. Most commercial systems and many prototype systems use transfer MT of varying depths of analysis (e.g., IBM Japan's Shalt Japanese and English and Mitsubishi's Meltran Japanese and English).

Shallow Semantic Transfer

Although syntactic transfer works well in simple cases, you usually need deeper analysis for better translation. Shallow semantic transfer systems analyze sentences for their meaning.

Researchers in computational linguistics and AI are adept at building software representations called shallow semantic frames, which capture the main aspects of the sentence's meaning without going into too much detail. By using representation terms that are tied closely to the words of the source and target languages, system builders construct lexicons of shallow semantic frames.

In the example, after semantic analysis of the English sentence is complete, the system applies its transfer rules to rewrite the resulting frame into one suitable for German. The rules specify appropriate substitutions, operating on frame items instead of grammatical classes. Even though the resulting German-oriented frame looks similar to the English one, in general, shallow semantic frames differ between languages.

System developers also build programs called analyzers, which identify the appropriate representation terms for each word or phrase and assemble the terms into a coherent structure. The analyzer often contains a parser, and sometimes it uses the parser as a front end.

Next the system must generate the natural language from the computer's internal data. Semantics require more generator sophistication: The generator has to find appropriate target-language words for the semantic frames it encounters.

In simple systems, each frame has just one lexical item, but in more complex systems, a semantic primitive may be linked to several alternative lexical items. For example, announce-act may be expressed with the word state, proclaim, or say. To make a decision in such cases, the generator requires more information on the formality of the text, the relationship between the author and audience, and the style.

Many recent systems perform semantic transfer and are implemented on single or multiple workstations, although the goal in many cases is to eventually shift them to personal computers. These systems include Logos (Logos, Mt. Arlington, NJ), Metal (Siemens Nixdorf, Munich, Germany), and Astransac (Toshiba, Kawasaki, Japan). Astransac operates on Toshiba's personal workstations, and Metal operates on multiuser workstations, using a Lisp environment server in the background.

In recent years, groups at the University of Kyoto (Japan) and the University of Manchester (U.K.) have started developing example- based MT systems. This software blends the two approaches, using the direct approach for stereotypical phrases (e.g., greetings) and a variant of the transfer approach in other cases.

Semantic Representation

Constructing valid representations of meaning is difficult-- terms can't be combined willy-nilly. Each term must be chosen with care, because it limits what other terms can be used. A wrong choice can mean that an aspect of the sentence can't be represented. In this case, the sentence would have to be analyzed again.

Consider the clause "each home will have an MT system." In what sense will each home have an MT system--the way one has an arm, a child, an idea, or a car? Languages such as Hungarian, which make more delicate distinctions than English and have different words for these senses of have, need more precise information.

In a simplified shallow semantic notation, the sentence of the earlier sample would be as follows:

announce-act

actor: Washington

act-time: yesterday

announcement: possess-state

actor: home

quant-determination: every

object: MT system

quant-determination: single

state-time: decade

meas-determination: end

Notice how the verb have in the sample sentence is represented as possess-state. The semantic lexicon entry for have contains a pointer to several possible meanings, including possess-state, think-act (i.e., the act of having an idea), and parent-state (i.e., the act of having a child). Each of these meanings is a frame with empty slots labeled actor, object, and time.

Each slot contains information that specifies the kinds of frames that can fill the slot. For example, the object of a possess-state must be a nonhuman physical object. Thus, neither an idea, which is defined as a nonphysical object, nor a child, which is a human, can be objects of possess-state. Conversely, the verb have must be represented as possess-state when it is applied to a physical object. Although the matter is not as simple as I have described it here (see the text box "The Five Layers of Ambiguity" on page 174), most analysis programs work along these lines.

As analysis of the source language goes deeper, it becomes more semantic, but generation back to the surface becomes more difficult. Fortunately, at least one function becomes easier for the system to perform: The amount of transfer required by the transfer rules decreases--the end point of analysis approximates the starting point of generation. This diminishing distance is illustrated in figure 3, which shows an arrangement of MT approaches along a gradient of increasing depth of analysis. More and more, the internal representations approach the ultimate goal: deep semantics, or language neutrality.

Interlingual Systems

Interlingual systems are software programs that translate text using a central data-representation notation called an interlingua. These systems have been experimental prototypes, usually written in Lisp and run on workstations. The East Asian consortium CICC (Center of the International Cooperation for Computerization) has been building an interlingual system since 1987, working with groups in Japan, China, Thailand, Malaysia, and Indonesia.

In the U.S., recently developed interlingual systems include KBMT at Carnegie Mellon University (Pittsburgh, PA), Ultra at New Mexico State University (Las Cruces, NM), and Pangloss, which is being jointly developed by the aforementioned universities and the University of Southern California's Information Sciences Institute (Marina Del Rey, CA). CICC and Pangloss systems run on multiple chained workstations. Systems like these should be available on personal computers by the end of this decade.

Constructing a set of transfer rules is tedious, but constructing an interlingua powerful enough to represent all the information every language may require, with the appropriate analysis and generation rules, is much more difficult. Consider the sample sentence again. Humans immediately realize that the word Washington is shorthand for "an announcer for the U.S. government" (this type of shorthand is called metonymy). A deep semantic representation must explicitly reflect this fact, because some languages may use different verbs for announce, depending on whether the announcer is a person or an official notice or bulletin.

Interlinguas must handle many phenomena similar to metonymy. In the example, the past tense reflected in the word announced means that the announcement took place before the author wrote the text. In general, in an interlingua, what is expressed by verb tense in English becomes a complex theory of temporal relationships involving the event, the text production, and the perspective the author takes on the event. Similarly, the meanings of words such as this, that, all, every, the, and a become a theory of how definite, unique, and near the object is to the author.

The meanings of these microtheories must be worked out and represented, and the values for the meanings must be associated with the words of the languages that express them. Although no analyzer has yet incorporated many microtheories of phenomena, most semantic analyzers contain rudimentary versions of them for common phenomena (e.g., time and focus).

The same increased complexity carries over to the generator. Given the semantic nature of an interlingual representation, a sufficiently powerful generator can usually produce several paraphrases of the input sentence.

Interlingual vs. Transfer

A debate is raging over which is the better approach-- interlingual systems or transfer systems. Interlingual systems are criticized because they require more detailed analysis than is necessary for any language pair. Why bother resolving the "Washington" metonymy when English, German, and French all say it the same way? Depending on the flexibility of the interlingua itself, however, this complaint is not always valid. In most systems, the analyzer is free to fill in less detail, and as long as the generator doesn't fail, the translation proceeds satisfactorily.

A major advantage that interlingual systems have over transfer systems lies in the number of rules used. A three-language transfer system, for example, requires transfer rules between languages A and B, A and C, and B and C in both directions--six sets of rules altogether. Adding another language involves creating rules to and from the new language to each of the first three languages, bringing the total to 12 sets of rules. In general, for n languages, you need n (n-1) sets of rules. That number can grow prohibitively large, because modern commercial MT systems have lexicons of over 150,000 words and grammars of several hundred rules.

Interlingual systems deal with this challenge by creating one representation midway between all the languages. Analyzers and generators have to work harder to reach this interlingual representation, but less human effort is required to construct the transfer rules.

Adding a new language to an interlingual system involves just two sets of rules: the analyzer's rules for going from sentences in the new language to interlingual representations and the generator's rules for going from interlingual representations to sentences in the new language. Because of the language neutrality of the interlingua, the analyzer does not require information about what the target language will be, nor does the generator have to have information about what the source language was.

Statistical Systems

Whether the MT system is interlingual or transfer, it requires large lexicons and rule sets to ensure that it will be robust when handling new text. In direct MT, you can encode information in dictionaries, phrases, and words; in transfer systems, you can place the data in grammars, lexicons, and transfer rules; in interlingual systems, you can locate the information in representation ontologies and lexicons.

The work required to create these resources by hand is daunting. As a result, in the past few years, there has been a resurgence of attempts to acquire the requisite information automatically.

What makes statistical systems different isn't so much how they perform translation but how their correspondence lexicons are constructed--automatically, not by humans. For a computer to create a correspondence lexicon, there must be two duplicate sets of a large amount of text--one set in each language. One such body of parallel text is the parliamentary record of the Canadian government. It contains several years' worth of representatives' debates in both English and French. Approximately 3 million sentences of this text are on-line.

Given such a mass of information, you can build programs to line up, as accurately as possible, each word in each sentence with its foreign counterpart. Recent alignment algorithms achieve over 90 percent accuracy. The result of such an alignment can be thought of as a bilingual correspondence lexicon of words and phrases.

Once the alignment algorithm has constructed the correspondence lexicon for the system, translation is effected by direct substitution, followed by a process of reordering the words to achieve good grammar. The reordering is performed using statistically derived rules regarding the probable order of words in given contexts.

Breeding Hybrids for Strength

The various approaches to MT theory have crystallized into a smooth continuum from direct systems to interlingual systems. The strengths and shortcomings of each approach are well understood. It's unlikely that new developments will uncover a magic formula that will make MT easy.

Recent developments in the Candide system from IBM (Hawthorne, NY) show that it's possible to create hybrid statistically based transfer systems. In such hybrid systems, the correspondence lexicon contains not only words and phrases but syntactic terms that represent time, number, and part of speech.

Statistically based systems require several mainframes to build and store their correspondence lexicons. They also need millions of sentences of parallel text containing the words to be translated. Given their capabilities and resources, however, hybrid systems are one way to minimize the human effort required in lexicon, grammar, and rule construction.

In the future, all major MT systems will be hybrids of one kind or another. Statistical lexicons and rule acquisition will provide the raw material that will be incorporated into MT systems, using increasingly powerful interlingual theories of meaning, pragmatics, and style.

Figure 1: Part of the syntactic structure produced by the parser for the sample sentence. In the tree, all nonleaf nodes are syntactic categories.

Figure 2: Part of the syntactic structure produced by the transfer rules from the tree in figure 1. Notice the movement of the embedded verb to the end of the sentence.

Figure 3: The approach taken in an MT system is correlated with the amount of analysis performed on the input sentences before they are translated. Direct systems perform little analysis, while interlingual systems construct complex internal representations of the meaning of the input.

Four MT Approaches

-- direct translations -- syntactic transfer -- statistical -- interlingual

Eduard Hovy is a project leader at the Information Sciences Institute of the University of Southern California. You can reach him on BIX c/o "editors" or on the Internet at hovy@isi.edu.

Per qualsevol problema amb aquesta pàgina, contacteu "de_yza@upf.es"

Per comentaris i observacions sobre el servidor, poseu-vos en contacte amb l'Administrador WWW