Summary of the paper

Title How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method
Authors Hai Zhao, Yan Song and Chunyu Kit
Abstract We investigate the impact of input data scale in corpus-based learning using astudy style of Zipf’s law. In our research, Chinese wordsegmentation is chosen as the study case and a series of experiments arespecially conducted for it, in which two types of segmentationtechniques, statistical learning and rule-based methods, are examined. Theempirical results show that a linear performance improvementin statistical learning requires an exponential increasing of training corpussize at least. As for the rule-based method, an approximatenegative inverse relationship between the performance and the size of the inputlexicon can be observed.
Language
Topics Corpus (creation, annotation, etc.), Statistical and machine learning methods
Full paper How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method
Bibtex @InProceedings{ZHAO10.199,
  author = {Hai Zhao, Yan Song and Chunyu Kit},
  title = {How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA