Lisa is a Data Curator at the NLP Team with TAUS. Using her background in linguistics and experience in the translation industry, she helps TAUS optimize the data offering and create new data solutions.
Acquiring high-quality parallel corpora is essential for training good-performing MT engines. There are a number of publicly available multilingual corpora, such as the proceedings of the European Parliament (Europarl) or transcribed TEDTalks (OPUS). Owing to their size and confirmed high quality, these have been used by researchers as sources of large-scale parallel language data.