Post-editing corpus English to Dutch, French, Portuguese, legal domain
Training data created for constructing quality estimation and automatic post-edition models (Ive et al. 2020). The data consists of tuples (source sentence, machine translation output, manual post-edition, independent reference translation) for three European language pairs. The data cover the following domains: online dispute resolution, procurement and justice. Number of tuples per language pair:
- English-Dutch: 11249
- English-French: 9989
- English-Portuguese: 10165
The machine translation output was produced with neural MT systems. All data were anonymized (replacement of person names and contact information). The tuples were randomized. The data are distributed as a set of four files per language pair (one for each element in the tuple).
If you use any part of the corpus in your own work, please cite the following paper:
Julia Ive, Lucia Specia, Sara Szoc, Tom Vanallemeersch, Joachim Van den Bogaert, Eduardo Farah, Christine Maroti, Artur Ventura and Maxim Khalilov, 2020, “A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?” Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3692–3697.
Also available via ELRC-SHARE under the short name APE-QUEST postedition tuples.
The Quality Gate’s user interface, which allows the user to request the MT output for a sentence and shows the quality estimation score and (if applicable) the automated post-edition output, is an adaptation by the project consortium of the open-source tool MateCat. The adaptation can be downloaded from this link: https://github.com/CrossLangNV/MateCat.