How does phrase-based machine translation work?

This is an approach based on large parallel and monolingual corpora and mathematical formulas. It’s also sometimes called statistical machine translation (SMT). The translation is achieved through assembling (putting together) separate translations of phrases found in the source sentence. The term phrase designates a sequence of words occurring in the training data for which another sequence of words is known to be its translation in the target language. The selection and arrangement of the translations of these phrases are based on several statistical models which are used especially to calculate (model) the probability for a given source phrase to be translated in a given way as well as the probability that a given translation text for this phrase will occur in a given context in the translated (target) sentence. Given the fact that some words have multiple meanings (and can be translated differently in different contexts) and other words form set phrases, phrasal verbs or collocations, this implies that the more multilingual texts we have available, the higher the probability that the selected translation text will be correct.

How does neural machine translation (NMT) work?

The text in the source language is split into sentences, the sentences are split into words and the words are further divided into “subwords”. The sequences of these subwords corresponding to individual source sentences are used as the input of the neural network which processes them to create a representation of the entire sentence which is subsequently converted into a sequence of subwords in the target language. The subwords thus obtained are joined to form words and these words are concatenated into sentences which then form the resulting translated text. Alternatively, also other steps can take place during this process for the purpose of e.g. maintaining the formatting in the target document etc. A neural machine translation engine learns by means of large amounts of parallel texts (text taken in one language and its translation in another language) to be able to translate given text to make it correspond to the training data (parallel text used for teaching it) as well as possible.

What is a neural network?

This is one of computational models used in machine learning (ML) and artificial intelligence (AI). One neuron is a computational unit having multiple inputs and giving one output which can then be used as the input of many other neurons. A neural network is then composed of a set of these artificial neurons which are interconnected by means of a huge number of links formed by outputs and inputs.

By presenting inputs and desired outputs to the neural network, it can be “taught” some general relations existing between these inputs and outputs. A neural network that has been taught this way can then also be used to estimate correct outputs for inputs which were not used when training the network.

What is a graphics card good for?

Graphics cards are used in machine translation for neural network calculations. This is due to the fact that a neural network allows for a large amount of parallel calculations to be done and thus high-performance graphics cards can achieve performance comparable to dozens of processors in this activity. Thus graphics cards can make neural MT engines many times faster which is manifested especially in quicker response, capacity to translate higher volumes of text in the same time span as well as in reduced hardware cost while maintaining the same computing performance.

What are the advantages of your solutions compared to the competition?

Compared to other available machine translation engines, Lingea Translator has the following advantages:

  • To achieve better results, we use our own language data and technologies: parallel as well as monolingual corpora, dictionaries, other data and morphological and other tools
  • Our MT engines can be deployed (off-line) on site on the customer’s server – to ensure the security of sensitive data. The data is not disclosed to any other entity during translation.
  • Minor languages used in Central Europe are equally important to us as major languages used by hundreds of millions of people. That is why they get our full attention and effort while e.g. Google focuses primarily on West European and Asian languages.
  • Our MT engines can be trained for specific technical fields and domains (e. g. automotive, engineering, banking, pharmaceutics etc.) - an MT engine thus specialized usually provides much better results than general MT engines.
  • We are able to ensure that the formatting of the original (source) document is preserved. Thus the resulting translation can have the same formatting (headings, titles, paragraphing, bold lettering, references etc.) as the original text. Our settings allow support of various formats. For example, it is possible to translate only some specific parts of the text in XML documents and leave the rest unchanged, or alternatively to use different MT engines for different parts of the text.
  • We can combine the use of MT engines with a dictionary or other language tools - e. g. automatically complete the diacritics in the text before translating it (relevant for example for e-mails and discussions), after translating the text, you can look up separate words in the dictionary just by clicking on them.

What happens if the resulting machine translation is incorrect?

An MT engine can never get everything hundred per cent right simply because not all source sentences are absolutely unambiguous. In every language there are ambiguous or equivocal terms and many sentences require certain additional knowledge (of the context) contained for example in the previous paragraph or elsewhere in the preceding text or simply resulting from some general knowledge. There can be cases of swapped object and subject, active and passive voice, some mistaken understanding of the structure of the sentence or misunderstood word meaning. Many sentences are hard to understand and translate even for an experienced human translator who has a good command of both the source and target language because the source sentence does not necessarily contain all information needed for correct and precise translation. Moreover, most sentences can be correctly translated in multiple ways but some translations are just less suitable in a given context or in terms of their style. A machine translator is not meant to entirely replace a certified human translator. A machine translator is a tool performing certain tasks for which it is designed and trained. It can for example make work easier for a human translator (by saving his/her time) or enable a human unfamiliar with the source language to find out certain information from the text. A great tool for these purposes is an integrated electronic dictionary which can be used by the user to verify the correctness of some important sections of the translation. A user who is not familiar with the target language is able to use a machine translator and the dictionary to quickly obtain certain relevant information from the text with a certain degree of certitude without having to assign it to a human translator and waiting for the translation.

Can it be deployed off-line?

If you use sensitive data at work (such as client e-mails or their documents) you are surely concerned about their security and copying confidential texts into on-line MT tools is probably a no-no for you. For these cases, we also offer the possibility of deploying our MT engines (off-line) directly in the customer’s infrastructure – this way, sensitive data never leaves your internal network. This solution, however, requires some additional investment in the translation server. Hardware requirements differ depending on the technologies deployed, translation direction (what is the source and target language) and the required translation speed. Generally, we could say that the prices of usable hardware start at about 1000 EUR, however they very much depend on the technologies deployed, required translation speed and expected load and can thus also be up to several times higher. Therefore, your particular configuration needs to be specified based on discussion with us. It can differ significantly depending on whether you go for the traditional statistical translation solution which requires especially extensive memory or if you go for the neural translation which does not need too much memory but requires high-performance graphics cards.

How can a neural translation engine be “trained”?

By selecting suitable training data and using them in different training stages, we can adapt the machine translation engines to specific fields of translation. This method was used for example for the field of public health in terms of an EU project called HimL (http://www.himl.eu/) or to translate tourism-related texts for our internal localization of tourist guides. This way we are able to prepare better quality MT engines which provide better outputs in a given thematic and technical field (so-called domain) than general (non-specific) MT engines and they also better preserve semantic precision. The quality of the result depends on the complexity of a given domain and the amount of domain-specific data which could be used. The most useful data here are the so-called parallel data. These are original texts together with their translations. But also very useful are domain-specific texts in the target language. Also usable are texts in the source language. For all of the above types of data it can be said that the more of them we have, the better. Also useful are domain-specific glossaries where, however, the quality of the data is more important than their quantity. The quality of the resulting MT engine thus often very much depends on the client’s possibilities and willingness to provide (naturally upon agreement) data suitable for training these specific models or at least their description which can be then used to obtain such training data from other sources.