Summary of the paper

Title Using Comparable Corpora to Adapt a Translation Model to Domains
Authors Hiroyuki Kaji, Takashi Tsunakawa and Daisuke Okada
Abstract Statistical machine translation (SMT) requires a large parallel corpus, whichis available only for restricted language pairs and domains. To expand thelanguage pairs and domains to which SMT is applicable, we created a method forestimating translation pseudo-probabilities from bilingual comparable corpora.The essence of our method is to calculate pairwise correlations between thewords associated with a source-language word, presently restricted to a noun,and its translations; word translation pseudo-probabilities are calculatedbased on the assumption that the more associated words a translation iscorrelated with, the higher its translation probability. We also describe amethod we created for calculating noun-sequence translationpseudo-probabilities based on occurrence frequencies of noun sequences andconstituent-word translation pseudo-probabilities. Then, we present a frameworkfor merging the translation pseudo-probabilities estimated from in-domaincomparable corpora with a translation model learned from an out-of-domainparallel corpus. Experiments using Japanese and English comparable corpora ofscientific paper abstracts and a Japanese-English parallel corpus of patentabstracts showed promising results; the BLEU score was improved to some degreeby incorporating the pseudo-probabilities estimated from the in-domaincomparable corpora. Future work includes an optimization of the parameters andan extension to estimate translation pseudo-probabilities for verbs.
Language Word Sense Disambiguation
Topics Machine Translation, SpeechToSpeech Translation, Statistical and machine learning methods, Word Sense Disambiguation
Full paper Using Comparable Corpora to Adapt a Translation Model to Domains
Bibtex @InProceedings{KAJI10.443,
  author = {Hiroyuki Kaji, Takashi Tsunakawa and Daisuke Okada},
  title = {Using Comparable Corpora to Adapt a Translation Model to Domains},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA