Are you worried about not understanding foreign languages or about the high cost of translation? Do you ever wonder how much information you miss because you only understand one or maybe a few other languages? There’s a lot of positive buzz surrounding Google’s recent introduction of the new machine translation system? Many people I have spoken with recently have been pleasantly s surprised by the quality of the translations.
Google’s new algorithms are based on Statistical Machine Translation (SMT) methods in combination with translation memory. This method differs very much from the more traditional grammar-based methods. For many years, the field of computational linguistics consisted on the one hand of research based on Chomsky’s theories of generative grammars, and on the other hand more statistical approached. Because of the complexity, non-robustness, and slow processing of the grammatical approach, personally, I have always favored the statistical approach. Twenty years ago, I even wrote my Ph.D. thesis on this topic and applied many statistical principles for the processing of natural language in an information retrieval context (still available in the long tail of Amazon: http://www.amazon.com/networks-language-processing-Information-Retrieval/dp/B0006PA9ZU/ref=sr_1_1?ie=UTF8&s=books&qid=1270038283&sr=8-1).
Over the years, the statistical and grammatical methods have more or less merged, where the better working approach is now based on statistical algorithms in combination with large corpora of natural language which is tagged with (simple) linguistic properties and linguistic structures (for instance, the Brown Corpus: http://en.wikipedia.org/wiki/Brown_Corpus). Linguistic probabilities are automatically calculated from large collections of data.
Statistical Machine Translation works on the same principle: from a large collection of sentence pairs in the source and target language, a SMT algorithm can derive the most probable translation for a particular sentence, phrase or word in a specific context. This approach leads the evolution of this effort with innovative technology that overcomes many of the problems of traditional automated translation. While the translations may not be legally submissible in court, they do provide great insights into the content of large document and e-mail collections.
Now, why are SMT systems suddenly so good? I believe there are two main drivers:
(i) After 9-11, the US intelligence forces were in great need of translations for languages such as Urdu, Pashtu, and Farsi. There were not enough screened translators and it was impossible to teach enough existing and newly (screened) employees to translate all the available data. Machine Translation was the only option. The problem with existing (grammar-based) translation tools was that training the system for a specific domain required the vendor to be involved. Of course, this was a problem because of the highly confidential nature of the data. And last but not least, understanding the training process required deep knowledge of computational linguistics, which is also a talent that is in limited supply. Statistical Machine Translation can automatically learn from sets of examples, a process that can be done in-house. Also, the SMT was able to process the often corrupted data much more comprehensively than the traditional approaches. So, basically, SMT did a better job on all requirements and US Intelligence agencies invested heavily in this new technology, making it even better.
(ii) Due to the availability of large translation databases from, for instance, the United Nations and the European Union, training the SMT algorithms is much easier than it was in the past. Finally, Moore’s law pertaining to the amount of data doubling every 18 months is to our advantage!
There is one golden rule in all statistical linguistic algorithms: THE ONLY GOOD DATA IS MORE DATA. And for that reason, I expect these algorithms only to become better and better, because the one thing that we can sure about is that we will have MUCH more data in a few years from now.
Here are the advantages of Statistical Machine Translation in more detail.
Volume: SMT technology has the unique ability to handle a large volume of translations, and do so quickly. It is an ideal solution for companies that have continuous publishing/translation cycles, large volumes of digital content, and growing demands to distribute more multilingual information.
Speed: SMT delivers the highest throughput commercially available for statistically-based, automated translation solutions as well as unprecedented speeds for translating digital content. Additionally, the speed at which a company can get up-and-running with a SMT solution is significantly faster than other options – from evaluation, to integration, to deployment.
Accuracy & Training: SMT offers you the ability to train translation systems to a specific domain or subject area to radically increase translation accuracy in-house. This process utilizes existing translated content to teach the software the terminology and style of the requested domain. This is especially interesting for intelligence and security organizations dealing with very confidential data. There is no need for you to disclose your data to a 3rd party.
Robustness: There is no other approach that can process incomplete, misspelled, inconsistent and otherwise wrong data better than Statistical Machine Translations. Where other approaches fail dramatically, SMT can easily work with such data and still produce meaningful results.
The applications of Machine Translation are endless: apart from the obvious ones in intelligence, law enforcement and investigations, there are many other applications in the fields of eDiscovery, compliance, information governance, auditing, and knowledgetranslation solutions accelerate the way the world communicates by “unlocking” large volumes of digital content that would not be translated without