Title |
How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method |
Authors |
Hai Zhao, Yan Song and Chunyu Kit |
Abstract |
We investigate the impact of input data scale in corpus-based learning using astudy style of Zipfs law. In our research, Chinese wordsegmentation is chosen as the study case and a series of experiments arespecially conducted for it, in which two types of segmentationtechniques, statistical learning and rule-based methods, are examined. Theempirical results show that a linear performance improvementin statistical learning requires an exponential increasing of training corpussize at least. As for the rule-based method, an approximatenegative inverse relationship between the performance and the size of the inputlexicon can be observed. |
Language |
|
Topics |
Corpus (creation, annotation, etc.), Statistical and machine learning methods |
Full paper  |
How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method |
Bibtex |
@InProceedings{ZHAO10.199,
author = {Hai Zhao, Yan Song and Chunyu Kit}, title = {How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |