Tech Thoughts

Good news for all Telugu speaking folks and those who have special interest in that language! Telugu is now one of the languages supported by Google Translation. In fact there are few other Indian languages too, like Bengali, Tamil etc. that are supported now, apart from Hindi.

Now for the bad news. The translation is really horrible! For those of you who can read and understand Telugu, here are few translations (have a good laugh!:-)):

How are you? - ఎలా మీరు?
ఎలా ఉన్నారు? - How do I have? (also sometimes How have?)
Google's translation is horrible - Google యొక్క అనువాద భయంకరమైన ఉంది

From the first two simple examples it is very clear that the translation is neither word to word translation nor follows a grammar. So what is the technique is used in Google Translation?

Before going into that, a couple of more observations:
One - for words and small phrases (which do not have much of grammatical forms), translation is fairly ok. Even a word like "ధాన్యము" is translated correctly as "Cereals". And I was surprised to see "గ్రహణము" to be translated as "Comprehension"! Even the well known phrases like "పుట్టిన రోజు శుభాకాంక్షలు" are translated correctly to "Happy Birthday".
Two - For longer lines and passages, though the translation is not grammatically correct, to some extant, we can get an overall idea of what it is all about. For example:

"Good news for all Telugu speaking folks and those who have special interest in that language! Telugu is now one of the languages supported by Google Translation." is translated to

"అన్ని తెలుగు మాట్లాడే చేసారో మరియు ఆ భాషలో ప్రత్యేక ఆసక్తి ఉన్నవారి కోసం శుభవార్త! తెలుగు ఇప్పుడు Google Translation మద్దతు భాషలు ఒకటి."

So what's the technique?

We first need to understand one point. Google Translation is a tool that is developed to support large number of languages, not just one or two. And the translation is for free of cost (at least as of now!). So it makes sense to look for a technique that works for any language, i.e., one which is not language specific. That is precisely what Google has chosen. It uses a technique called "Statistical Machine Translation" (SMT). SMT, as against what is called as "Rule Based Translation", purely depends on data. There are no rules of grammar (specific to a language) used in SMT. Since it is based purely on statistical methods, it requires considerable volume of data in order to give reasonable level of accuracy.

So what is the data required? There are two kinds of data sets required for SMT. One is Bilingual data set and the other is monolingual data set.
Bilingual data set is a collection of text in one language, translated (human translation) to another language. Let us say we need to develop translation between English and Telugu languages. We would need a collection of text in Telugu or English and its translation into the other language. Using this as training data, SMT program would try to map the words between these two languages. There are various algorithms that can be used to obtain this mapping. Essentially, they look for the mappings that are more probable based on given set of data. One can map individual words or a set of words also. "Happy Birthday" is a very good example where the translation program has correctly mapped the entire phrase to "పుట్టినరోజు శుభాకాంక్షలు" in Telugu.
Monolingual data set is a collection of text in the target language. For example, when we translate from English to Telugu, Telugu is the target language. We would need a collection of text in Telugu for the SMT program to learn construction of Telugu words and sentences. This is primarily used for identifying the correct sequence of words in a sentence. For example, if you look at the sentence "Good news for all Telugu speaking folks..." above, the word "Good news" which is at the beginning of the English sentence, has gone to the end of the sentence in Telugu. And that is correct! How did the program know it? It did not know, it would have merely guessed it. Monolingual data is primarily used for this. Again there are various algorithms that can be used to arrive at the best possible sequencing. They could be based on just looking at the position of word or word phrase in the training data. They could also be based on looking at co-occurrence of two or more words.

Fine. So how can the translation be improved?

The simple answer is - create more and more data! More the volume of data, more the accuracy. Looks like one would require data sets of size 300-600 million words in order to get some moderate accuracy. I do not know where the current data set for Telugu comes from, nor its size. Google gives a Translator toolkit for translators to translate various types of documents using a combination of automatic and human translation. For example, one can automatically translate a Wiki page in English to Telugu using Google Translator and then manually correct it and publish it back to Wiki. Google keeps all the translations in a database. By creating more and more translations this way, the size of the data set is increased resulting in better quality of Google Translator. Online community can be formed for doing English - Telugu translations of Online information sources (like Wikipedia). This would server two purposes. More and more information would be available in both the languages and Google Translation would become more and more accurate.

Another possibility is to create a Telugu language specific translator as a combination of Rule-based and SMT. Of course, Google would not be interested to take it up. It will have to come from Telugu people. This hybrid translator can use Google's SMT for the statistical part of it through the Translator API published by Google. Rule-based translation would require lesser computational resources, as compared to SMT.

I think there are couple of things that Google should also work on for improving the translation quality for Telugu.

One - better use of dictionary. I am not sure if any English-Telugu dictionary is being used during the translation process. If it is indeed used, it is not working properly. Because, even basic words like "అమ్మ" are translated incorrectly as "I".

Second - improving the mechanisms to address high degree of agglutination and inflection in a language like Telugu. Agglutination is attaching affixes to a root word. Inflection is changing the form of word for different contexts. "విభక్తులు" in Telugu are a good example for this. For example "Sita is wife of Rama" translates to "సీత రాముని భార్య." Here, the preposition "of", which is a separate word in English, becomes the part of noun "రామ" as "రాముని", changing the word form. Both agglutination and inflection pose a challenge in automatic mapping of words from English to Telugu. Though Google claims that the experience in German, Turkish and Russian language translation has helped it deal with the agglutination problem, it looks dubious. Because, most of the problems in English - Telugu translation seems to be caused only due to agglutination.

On the whole, I think Google translation support for Telugu is indeed a good step towards creating more digital content in Telugu and also in making Telugu content available to a wider audience across the globe.

Couple of good links:
How Google Translate Works
Statistical Machine Translation: Foundations and Recent Advances

Tech Thoughts

Pages

Archive

Google Translation for Telugu