Q1 – How many words do I need to build a good engine?

Most people will tell you that 2 million words is the bare minimum you can provide for a “bare bones” engine and some kind of automation within a domain – but do not expect great results if you are dealing with texts they may include a lot of new, unexpected words like economics, or journalism. If you are dealing with a highly controlled language and you little variation on your theme (technical manuals, set documentation packages, etc), try to pump up as much text as you can.

Typical PangeaMT developments within domains (software, electronics, automotive, engineering, tourism) have started at 5M words. There several ways to increase the number of words by gathering reliable parallel texts and PangeaMT offers consultancy and guidance so you can start an engines with as many words as possible. We call a  engine with 15M or 20M words “mature” within a domain, because it is likely to have most of the terminology, vocabulary and expressions required for that language domain. Do not despair if you do not have so much data. The important thing is to get the engine started. You can add post-edited material and other materials that you gather with experience in later re-trainings.

There has been much argument about “unreasonable effectiveness of massive amounts of data”  versus “smaller amounts of well-selected data”. Many people considering their first MT development are unsure as to whether put in as much text as possible (massive amounts of data) or to select the most accurate bilingual texts possible even if that means dealing with smaller sets of data. Our experience points in several directions

a) if you are trying to build a generalist type of engine, capable of translating the unexpected (from news articles to economics papers and literature), gather as much data as possible. You are trying to build a system to cater for sunny days and rainy days. No number will ever be enough. Sooner or later, you will need to build some kind of syntactical aids into it.

b) if you are trying to build an engine that will fit your particular language field and needs (or even if you want an engine that understand your products and services, but also some kind of financial information and legalities), you do not need trillions of literature. In this case, gathering as much data as possible from your organization (or similar)  seems more reasonable and worth the effort.

Either way, do not underestimate the effort and teamwork required during the data-gathering stages. This is essential for the good training (and thus, the results) of the engine. It will be the beginning of the change in your adoption of MT technologies and a good chance to involve stakeholders in the process.