Summary of the paper

Title Data Issues in English-to-Hindi Machine Translation
Authors Ondřej Bojar, Pavel Straňák and Daniel Zeman
Abstract Statistical machine translation to morphologically richer languages is achallenging task and more so if the source and target languages differ in wordorder. Current state-of-the-art MT systems thus deliver mediocre results.Adding more parallel data often helps improve the results; if it doesn't, itmay be caused by various problems such as different domains, bad alignment ornoise in the new data. In this paper we evaluate the English-to-Hindi MT taskfrom this data perspective. We discuss several available parallel data sourcesand provide cross-evaluation results on their combinations using two freelyavailable statistical MT systems. We demonstrate various problems encounteredin the data and describe automatic methods of data cleaning and normalization.We also show that the contents of two independently distributed data sets canunexpectedly overlap, which negatively affects translation quality. Togetherwith the error analysis, we also present a new tool for viewing alignedcorpora, which makes it easier to detect difficult parts in the data even for adeveloper not speaking the target language.
Language Corpus (creation, annotation, etc.)
Topics Machine Translation, SpeechToSpeech Translation, Evaluation methodologies, Corpus (creation, annotation, etc.)
Full paper Data Issues in English-to-Hindi Machine Translation
Bibtex @InProceedings{BOJAR10.756,
  author = {Ondřej Bojar, Pavel Straňák and Daniel Zeman},
  title = {Data Issues in English-to-Hindi Machine Translation},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA