Summary of the paper

Title Belgisch Staatsblad Corpus: Retrieving French-Dutch Sentences from Official Documents
Authors Tom Vanallemeersch
Abstract We describe the compilation of a large corpus of French-Dutch sentence pairsfrom official Belgian documents which are available in the online version ofthe publication Belgisch Staatsblad/Moniteur belge, and which have beenpublished between 1997 and 2006. After downloading files in batch, we filteredout documents which have no translation in the other language, documents whichcontain several languages (by checking on discriminating words), and pairs ofdocuments with a substantial difference in length. We segmented the documentsinto sentences and aligned the latter, which resulted in 5 million sentencepairs (only one-to-one links were included in the parallel corpus); there are2.4 million unique pairs. Sample-based evaluation of the sentence alignmentresults indicates a near 100% accuracy, which can be explained by the textgenre, the procedure filtering out weakly parallel articles and the restrictionto one-to-one links. The corpus is larger than a number of well-knownFrench-Dutch resources. It is made available to the community. Furtherinvestigation is needed in order to determine the original language in whichdocuments were written.
Language Language Identification
Topics Corpus (creation, annotation, etc.), Multilinguality, Language Identification
Full paper Belgisch Staatsblad Corpus: Retrieving French-Dutch Sentences from Official Documents
Bibtex @InProceedings{VANALLEMEERSCH10.758,
  author = {Tom Vanallemeersch},
  title = {Belgisch Staatsblad Corpus: Retrieving French-Dutch Sentences from Official Documents},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA