Q17 – Is your machine translation good with Czech language?

This is a typical question from some of our Slavic-speaking clients: Is your machine translation good with Czech language? Is your machine translation good with Russian? Is your machine translation good with Croatian?

Slavic languages have many cases (word inflections). This made statistical machine translation to work quite badly as the possibilities of any string happening were quite low. These type of languages are also called “morphologically rich languages” because of the number of combinations that are possible.

Neural networks changed the approach completely. A neural network works well below and above the word level to understand the complexities of how each word is formed and how it relates to the words next to it. This means that neural networks-based machine translation understands far better the relations between the different words in a sentence. By taking into account the dependencies between the words as a result of the training data provided, neural works-based machine translation provides output that translates in the feeling of near-human flow or human-quality machine translation.

One of our clients asked us

I thought that PangeaMT provide only generic engines and we can customize this engines with our own TMs and create in-domain specific „mirrors“ (with using “OnlineTraining” module). And I know that our language combinations (EN <-> CS and DE <-> CS, both ways) are not enough supported by other MT providers (Czech language is really complicate for MT solutions). So I had to ask If PangeaMT provide this two combination as well.

Well, indeed, you can customize your engine with our online tool using your own server. This provides a lot of freedom and independence when setting up a machine translation environment for a translation agency. As a language consultant, linguists tackle texts and documents of a different nature and conflicting terminology. Mixing everything in a single engine would be detrimental to performance and accuracy.

Take the following Czech English TMX file.

Text view of an English Czech Translation Memory

Translators are very familiar with this format. It is the txt version (database version) of a Translation Memory. Every time a translator saves a segment, it is creating an equivalent of the source sentence in the target language. This is wonderful for machine learning as translators create parallel data. It is the basis of many developments at PangeaMT.

A neural network will find the relations between the sentences and similarities, to the syllable and letter level if necessary (this is a very useful feature in neural training called BPE). It is also responsible for neural machine translation’s success and higher acceptability than the previous n-gram based “Statistical Machine Translation”, that is still successful with short sentences because its higher “memorizing” capabilities, as explained in our first neural machine translation developments publications back in 2017. Our findings at the time proved that a short sentence with less than 9-10 words could probably be translated more accurate with a statistical system than with a neural system. As system have improved over the years, the gap between one and the other has shortened. However, it is true that when ecommerce websites only need to translate a couple of words, and those words have been part of the training data, a statistical system will recall them more quickly and efficiently. A neural system, however, will reconstruct the sentence with a more human fluency.

Therefore, if you ask if our machine translation is good with Czech language, the answer is YES! We have the team, technology and data to make your MT engine run smoothly and produce high quality translation, millions of words!