Ideally, your customized engine(s) should only contain your own data to ensure no noisy material perturbs your writing or company style. In reality, few organizations have as much data available. Data gathering and consultancy on how to obtain more relevant data has become a favorite sport among SMT developers.
As part of our consultancy services, PangeaMT can add more muscle to your initial set of data so that a large linguist corpus comes into the training (we most probably have quite a bit to build a Language Model or turn any of our Language Models more like your style). All the data we add will be relevant to your subject field and the engines will be tested with and without it so you can check the effect of more data on your development. (You can find an abridged version of what a test can look in our October 2009 news. This was part of a free test for several organizations.)
Generally speaking, it is assumed that the more data the better. There has been some controversy as to whether smaller and cleaner sets of data provide higher accuracy. This will depend largely on your application and if “world awareness” is required by your system or if you are running an engine for a very specific domain. 2M words of civil engineering data will probably have little impact if you are building a system for a software company fighting virus, or a medical engine fighting a very different kind of virus. It is a common mistake to add data and data thinking it will be useful at some point, but our studies conclude that if that data is not likely to be needed/recalled, it is better to leave it as part of your Language Model.
In short, there is no way to ensure that statistics will work one way or another (that is precisely the point of statistics, they analyze the chances of something happening). If the system is too wide, pre- and post-processing systems can be built (in a kind of hybridation) to “fix” or “force” certain expressions. There are other ways of working towards higher chances, as it can be done with the combined engine method or the combined hypothesis (i.e. combining parts of likely outputs with a high certainty to remake sentences which the engine reprocesses). So far, we have heard good experiences of post-editors using the same terminology tools as with CAT tools to check terminology consistency.