Q16 – How about data cleaning? What is your approach?

Companies cannot underestimate (and often only begin to understand) the effort required in data cleaning when they begin to export bilingual (parallel) data for machine learning. Due to CAT limitations and features, noise can enter in a sentence in the shape of unwanted code, but the concept of data cleaning goes beyond removing in-lines, as explained in Q14.

Some typical examples of data cleaning which is necessary were presented at Japan Translation Federation 2011 as part of our Japanese Syntax-Based Machine Translation Hybrid.

Anybody who has been in the translation industry long enough has come across some kind of “bad” TM. This could come in many shapes, from simply being a bad translation to being terminologically inaccurate, etc. Fortunately for our users, this kind of data cleaning has become part of PangeaMT’s standard cleaning procedure.

Some of the basic cleaning cycles are described below. They take into account some procedures which have been automated to system owners so they can rest assured that

  • their initial training data is clean before engine training in order to achieve the best results possible
  • any future post-edited material also goes through a virtuous cleaning cycle in order to check any noise that may be introduced in the system and thus affect re-trainings.
data cleaning cycle for machine translation engine training

Data cleaning cycle for machine translation engine (re)-training

PangeaMT needs to ensure initially that the initial training set from the client has passed all cleaning checks before training. This will result in a lean bitext (parallel corpus) and aid computer learning. Together with PangeaMT’s own processes, from language-specific rules to syntax or POS tagging, data enters the engine training cycle.

This is not a comprehensive list of all cleaning steps. Nevertheless, it will allow users to realise what kind of material will be extracted for human approval before re-entering the training cycle. All segments detected as “suspect” will be stripped out of the training set for human approval /revision / editing in TMX format and then re-entered in the system.
1.       Segments with significant difference in length between the source and target
Generally, we consider a sentence a “suspect” when it is more than 50% in length, but this can be varied according to your particular needs (Czech, for example, is usually shorter than English and French being 25% or 30% longer than English is not an indication per se of there being anythign wrong).

2.       Segments where source or target contains typographical symbols missing in the other, such as [ ], *, + =.

3.       Segments where source and target are identical.

4.       “Empty segments”, i.e. segments with source but no target.

5.       Segments containing particular names or expressions which are part of the client’s preferred terminology.

All these are candidates for human revision.

This is (one of the things) that sets PangeaMT apart from other offerings: we will train you and provide you with the tools so you become your own master in future re-trainings.
Clean data is the route to quality input and thus improved engine performance. The old translation saying applies: garbage in, garbage out. Thanks to our cleaning routines, you can rest assured that you will own a system which will strip out any “dubious” material for your consideration. But even after installation, please remember you have a full year free support. Any odd results you see or experience, any patterns you would like to apply/correct, we are here to help. This is not a black box system or company selling words or engines. Our model is “user empowerment“, i.e. technology transfer.