Summary of the paper

Title Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus
Authors Orphée De Clercq and Maribel Montero Perez
Abstract After three years of work the Dutch Parallel Corpus (DPC) project has reachedan end. The finalized corpus is a ten-million-word high-qualitysentence-aligned bidirectional parallel corpus of Dutch, English and French,with Dutch as central language. In this paper we present the corpus and try toformulate some basic data collection principles, based on the work that wascarried out for the project. Building a corpus is a difficult andtime-consuming task, especially when every text sample included has to becleared from copyrights. The DPC is balanced according to five text types(literature, journalistic texts, instructive texts, administrative texts andtexts treating external communication) and four translation directions(Dutch-English, English-Dutch, Dutch-French and French-Dutch). All the textmaterial was cleared from copyrights. The data collection process necessitatedthe involvement of different text providers, which resulted in drawing up fourdifferent licence agreements. Problems such as an unknown source language,copyright issues and changes to the corpus design are discussed in close detailand illustrated with examples so as to be of help to future corpus compilers.
Language Other
Topics Acquisition, Corpus (creation, annotation, etc.), Other
Full paper Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus
Bibtex @InProceedings{DECLERCQ10.204,
  author = {Orphée De Clercq and Maribel Montero Perez},
  title = {Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA