Data cleaning
Do more with less data
We process & prepare your noisy data to increase performance.

More data is good, but clean data is always better. Cleaned and correctly processed data is what makes the difference. Clean data can mean different things, ranging from removing data bias to assuring better linguistic quality. Or filtering data to perform specific customized training.

Improve language quality
Data cleaning has an immediate and measurable impact on the output quality of MT engines from a purely linguistic perspective. TAUS applies automatic and supervised cleaning steps to optimize data for MT training, ranging from deduplicating segments, language identification in mixed language datasets, advanced sentence embedding, alignment checkers and heuristic rules. These cleaning steps are ideal for the transformation of legacy datasets, such as translation memories.
Customized training
MT engines may need to be adapted for special use cases, like customer support or product upgrades. For these customized training tasks, data cleaning can be applied to filter data based on grammatical categories, tags or out-of-domain texts. Custom corpora can be built based on shorter sentence length, specific keywords and vocabulary.
Remove data bias
Data cleaning may also be applied to remove data bias and filter legacy data from outdated cultural annotations or salutations.