Summary of the paper

Title Evaluating Utility of Data Sources in a Large Parallel Czech-English Corpus CzEng 0.9
Authors Ondřej Bojar, Adam Liška and Zdeněk Žabokrtský
Abstract CzEng 0.9 is the third release of a large parallel corpus of Czech and English.For the current release, CzEng was extended by significant amount of texts fromvarious types of sources, including parallel web pages, electronicallyavailable books and subtitles. This paper describes and evaluates filteringtechniques employed in the process in order to avoid misaligned or otherwisedamaged parallel sentences in the collection. We estimate the precision andrecall of two sets of filters. The first set was used to process the databefore their inclusion into CzEng. The filters from the second set were newlycreated to improve the filtering process for future releases of CzEng. Giventhe overall amount and variance of sources of the data, our experimentsillustrate the utility of parallel data sources with respect to extractableparallel segments. As a similar behaviour can be expected for other languagepairs, our results can be interpreted as guidelines indicating which sourcesshould other researchers exploit first.
Language Machine Translation, SpeechToSpeech Translation
Topics Corpus (creation, annotation, etc.), Evaluation methodologies, Machine Translation, SpeechToSpeech Translation
Full paper Evaluating Utility of Data Sources in a Large Parallel Czech-English Corpus CzEng 0.9
Bibtex @InProceedings{BOJAR10.642,
  author = {Ondřej Bojar, Adam Liška and Zdeněk Žabokrtský},
  title = {Evaluating Utility of Data Sources in a Large Parallel Czech-English Corpus CzEng 0.9},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA