Maryland Research   University of Maryland
  SPRING 2002 VOL, 11 NO. 2 Layer 2 Layer 3
Departments

dot

the next best thing

on the drawing board

ideas in action

the business side

up and coming

the next generation

the alpha list

features
 

Bridging the Language Divide

"Traduttore, traditore," runs the old Italian aphorism, referring to the perils of translation. But a team of University of Maryland scientists is attempting to disprove it. Their computer translations into English from Chinese, French, Spanish and Arabic are intended to transfer information between languages without losing original meaning.

Researchers from the Computational Linguistics and Information Processing Lab, or CLIP Lab, are expanding efforts in two directions: building search engine techniques that find relevant documents in multiple languages, and finding ways to provide users with sensible translations of those documents.

Faculty collaborators on the project include Bonnie Dorr, associate professor of computer science and Presidential Faculty Fellow; Amy Weinberg, associate professor of linguistics; Philip Resnik, associate professor of linguistics; and Douglas Oard, assistant professor in the College of Information Science. Working together in the University of Maryland Institute for Advanced Computer Studies, or UMIACS, these scholars envision a world where the typical user can query a Web search engine in his or her own language and get back the needed information regardless of a Web page's language of origin.

pic

In order to do a good job of cross language search or automatic translation, the computer needs a good model of the relationships between languages. Resnik has been focusing on techniques that automatically learn such models by analyzing large quantities of text in parallel translation, known as "parallel corpora." The problem is that such parallel corpora are hard to find. They exist in such domains as Canadian parliamentary proceedings, where both English and French are official languages, and the proceedings of the United Nations. "But it's a very skewed sample,"laments Resnik.

For example, in the early days "one system decided that the translation of the English word 'hear' in French was 'bravo,' because that word was so frequently translated by the phrase 'Hear! Hear!'" To help solve this problem, Resnik has developed a software tool "that goes out onto the Web and automatically finds pairs of pages that are translations of each other." He reports that the technique has a 90 to 100 percent accuracy rate. He has also adapted the software so that it efficiently locates translations on the Internet Archive (www.archive.org), an enormous centralized repository of Web pages. The archive has proved very fruitful for supplying parallel corpora, including thousands of translated documents in English and French, English and Chinese, and English and Arabic.

Having obtained pages that are parallel translations of each other, Resnik and Dorr break up the matching pages into matching sentences and even matching words. This process, called "alignment," involves algorithms that help to solve automatically a problem common to traditional machine translation techniques: inadequate bilingual dictionaries. "It's very important to have a good dictionary," says Resnik, "but no dictionary can ever be complete. They all have gaps ... Many things that are important are changing every day. New words come into use--new places, new names, new technical terms."

Better word-to-word dictionary entries make it possible to build better bilingual lexicons, containing detailed linguistic information, by "porting" the linguistic information known about one language to the other using the word-level dictionary entries as a bridge. "To build a Chinese lexicon, we can translate the knowledge about the Engish to the Chinese and use the result for machine translation of Chinese," explains Dorr. Armed with this knowledge, her translation techniques can use words and grammatical structure to determine the intended meaning of a sentence.

These scientists are also taking advantage of a parallel corpus that significantly predates the World Wide Web--the Bible. Because the Bible has been so carefully and widely translated, it provides a unique opportunity for studying how the same meanings are expressed in multiple languages.
--PS


Maintained by the University of Maryland ElectricPub
Comments and questions about this web site may be directed to electricpub@umail.umd.edu