LREC 2010 Proceedings

Summary of the paper

Title	Word-based Partial Annotation for Efficient Corpus Construction
Authors	Graham Neubig and Shinsuke Mori
Abstract	In order to utilize the corpus-based techniques that have proven effective innatural language processing in recent years, costly and time-consuming manualcreation of linguistic resources is often necessary. Traditionally theseresources are created on the document or sentence-level. In this paper, weexamine the benefit of annotating only particular words with high informationcontent, as opposed to the entire sentence or document. Using the task ofJapanese pronunciation estimation as an example, we devise a machine learningmethod that can be trained on data annotated word-by-word. This is done bydividing the estimation process into two steps (word segmentation andword-based pronunciation estimation), and introducing a point-wise estimatorthat is able to make each decision independent of the other decisions made fora particular sentence. In an evaluation, the proposed strategy is shown toprovide greater increases in accuracy using a smaller number of annotated wordsthan traditional sentence-based annotation techniques.
Language	Tools, systems, applications
Topics	Corpus (creation, annotation, etc.), Statistical and machine learning methods, Tools, systems, applications
Full paper	Word-based Partial Annotation for Efficient Corpus Construction
Bibtex	@InProceedings{NEUBIG10.408, author = {Graham Neubig and Shinsuke Mori}, title = {Word-based Partial Annotation for Efficient Corpus Construction}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }