LREC 2010 Proceedings

Summary of the paper

Title	Adapting Chinese Word Segmentation for Machine Translation Based on Short Units
Authors	Yiou Wang, Kiyotaka Uchimoto, Jun’ichi Kazama, Canasai Kruengkrai and Kentaro Torisawa
Abstract	In Chinese texts, words composed of single or multiple characters are notseparated by spaces, unlike most western languages. Therefore Chinese wordsegmentation is considered an important first step in machine translation (MT)and its performance impacts MT results. Many factors affectChinese word segmentations, including the segmentation standards andsegmentation strategies. The performance of a corpus-based word segmentationmodel depends heavily on the quality and the segmentation standard of thetraining corpora. However, we observed that existing manually annotated Chinesecorpora tend to have low segmentation granularity and provide poormorphological information due to the present segmentation standards. In thispaper, we introduce a short-unit standard of Chinese word segmentation, whichis particularly suitable for machine translation, and propose a semi-automaticmethod of transforming the existing corpora into the ones that can satisfy ourstandards. We evaluate the usefulness of our approach on the basis oftranslation tasks from the technology newswire domain and the scientific paperdomain, and demonstrate that it significantly improves the performance ofChinese-Japanese machine translation (over 1.0 BLEU increase).
Language	Standards for LRs
Topics	Machine Translation, SpeechToSpeech Translation, Parsing, Standards for LRs
Full paper	Adapting Chinese Word Segmentation for Machine Translation Based on Short Units
Bibtex	@InProceedings{WANG10.83, author = {Yiou Wang, Kiyotaka Uchimoto, Jun’ichi Kazama, Canasai Kruengkrai and Kentaro Torisawa}, title = {Adapting Chinese Word Segmentation for Machine Translation Based on Short Units}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }