LREC 2010 Proceedings

Summary of the paper

Title	How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method
Authors	Hai Zhao, Yan Song and Chunyu Kit
Abstract	We investigate the impact of input data scale in corpus-based learning using astudy style of Zipf’s law. In our research, Chinese wordsegmentation is chosen as the study case and a series of experiments arespecially conducted for it, in which two types of segmentationtechniques, statistical learning and rule-based methods, are examined. Theempirical results show that a linear performance improvementin statistical learning requires an exponential increasing of training corpussize at least. As for the rule-based method, an approximatenegative inverse relationship between the performance and the size of the inputlexicon can be observed.
Language
Topics	Corpus (creation, annotation, etc.), Statistical and machine learning methods
Full paper	How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method
Bibtex	@InProceedings{ZHAO10.199, author = {Hai Zhao, Yan Song and Chunyu Kit}, title = {How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }