Those seeking to apply machine translation solutions within their companies or institutions often only begin to understand the effort required in data cleaning when they start to export bilingual (parallel) data for MT engine training. Due to some CAT limitations and features, noise can enter in a sentence in the shape of unwanted code, but the concept of data cleaning goes beyond removing in-lines, as explained in Q14. Thus, using CAT tools from companies which base their philosophy on open standards and clean data processing is so necessary.
Some typical examples of data cleaning which is necessary were presented at Japan Translation Federation 2011 as part of our Japanese Syntax-Based Machine Translation Hybrid.
Anybody with some translation experience will have come across some kind of “bad” TM. A “bad” TM can come in many shapes, from simply being a bad translation to being terminologically inaccurate, or even having split sentences, etc. Fortunately for our users, this kind of data cleaning has become part of PangeaMT’s standard cleaning procedure.
Some of the basic cleaning cycles are described below. They take into account some procedures which have been automated to system owners so they can rest assured that:
- their initial training data is clean before engine training in order to achieve the best results possible;
- any future post-edited material which you want to use as re-training material also goes through a virtuous checking cleaning cycle in order to eliminate any noise that may be introduced in the system and thus affect re-trainings.
PangeaMT needs to ensure initially that the initial training set from the client has passed all cleaning checks before training. This will result in a lean bitext (parallel corpus) and aid computer learning. Together with Pangeanic’s own processes, from language-specific rules to syntax or POS tagging, data enters the engine training cycle.
This cannot a comprehensive list of all cleaning steps. However, it allows users to realise the kind of material that will be extracted for human approval before re-entering the training cycle. All segments detected as “suspect” will be taken out of the training set for human revision and approval in TMX format and then re-entered in the system.
- Segments with significant difference in length between the source and target. Generally, we consider a sentence a “suspect” when it is more than 50% in length, but this can be varied according to your particular needs (Czech, for example, is usually shorter than English and French being 25% or 30% longer than English is not an indication per se of there being anything wrong).
- Segments where source or target contains typographical symbols missing in the other, such as [ ], *, + =.
- Segments where source and target are identical.
- “Empty segments”, i.e. segments with source but no target.
- Segments containing particular names or expressions which are part of the client’s preferred terminology.
All these are candidates for human revision which need to be checked before entering the system as “proper” training material.
This is (one of the things) that sets us apart from other offerings: we will train you and provide you with the tools so you become your own master in future re-trainings. Clean data is the route to quality input and thus improved engine performance. The old translation saying applies: garbage in, garbage out. Thanks to our cleaning routines, you can rest assured that you will own a system which will strip out any “dubious” material for your consideration. But even after installation, please remember you have a full year free support. Any odd results you see or experience, any patterns you would like to apply/correct, we are here to help. This is not a black box system. We are not a company selling machine-translated words at a very cheap rate or gathering data to build and hire engines. Our model is “user empowerment“, i.e. providing you the technology that will enable to grow your own MT solution that is most appropriate to your needs.