Your engines will be built with material that you need to provide PangeaMT for the training. Otherwise, we can use generic material that we have in most language combinations. In September 2019, we have 4,5Bn aligned sentences into over 80 languages – that is 3Bn sentences for machine learning than in 2018 as reported in Slator.
PangeaMT will use this material to refine a language model for your particular case (i.e. an engine that speaks pharma like a bilingual EN/FR, or an engine that speaks like a bilingual German engineer, etc). Depending on the particular field and size of your bilingual data, more content may be required or it will need to be generated. Thus, the first engine, good as it will be, is at what we call “Stage 1” (really we call it teenage). Once you provide us with more information (typically a TMX file with previous translations or post-edited content), we re-train the engine with more material as it is intended to translate. This means that the engine gives more and more preference to certain expressions, word combinations.
PangeaMT reached 1’2Bn aligned sentences for machine learning in 2018 and 4’5Bn in 2019. Gathering huge resources for machine learning helps it create near-human quality machine translation engines with little client text input.
In-domain material is usually added at the beginning and at the end of the neural engine training cycle. This ensures that the algorithm picks up the nuances and characteristics of the domain, language and field it will translate. This is particularly true when material is added at the time of the training cycle (the last epoch), which is highly prioritized and thus serves as a “domain and style filter”.