Summary of the paper

Title There’s no Data like More Data? Revisiting the Impact of Data Size on a Classification Task
Authors Ines Rehbein and Josef Ruppenhofer
Abstract In the paper we investigate the impact of data size on a Word SenseDisambiguation task (WSD). We question the assumption that the knowledgeacquisition bottleneck, which is known as one of the major challenges for WSD,can be solved by simply obtaining more and more training data. Our case studyon 1,000 manually annotated instances of the German verb "drohen" (threaten)shows that the best performance is not obtained when training on the full dataset, but by carefully selecting new training instances with regard to theirinformativeness for the learning process (Active Learning). We present athorough evaluation of the impact of different sampling methods on the datasets and propose an improved method for uncertainty sampling which dynamicallyadapts the selection of new instances to the learning progress of theclassifier, resulting in more robust results during the initial stages oflearning. A qualitative error analysis identifies problems for automatic WSDand discusses the reasons for the great gap in performance between humanannotators and our automatic WSD system.
Language Statistical and machine learning methods
Topics Word Sense Disambiguation, Tools, systems, applications, Statistical and machine learning methods
Full paper There’s no Data like More Data? Revisiting the Impact of Data Size on a Classification Task
Bibtex @InProceedings{REHBEIN10.806,
  author = {Ines Rehbein and Josef Ruppenhofer},
  title = {There’s no Data like More Data? Revisiting the Impact of Data Size on a Classification Task},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA