Title |
Word-based Partial Annotation for Efficient Corpus Construction |
Authors |
Graham Neubig and Shinsuke Mori |
Abstract |
In order to utilize the corpus-based techniques that have proven effective innatural language processing in recent years, costly and time-consuming manualcreation of linguistic resources is often necessary. Traditionally theseresources are created on the document or sentence-level. In this paper, weexamine the benefit of annotating only particular words with high informationcontent, as opposed to the entire sentence or document. Using the task ofJapanese pronunciation estimation as an example, we devise a machine learningmethod that can be trained on data annotated word-by-word. This is done bydividing the estimation process into two steps (word segmentation andword-based pronunciation estimation), and introducing a point-wise estimatorthat is able to make each decision independent of the other decisions made fora particular sentence. In an evaluation, the proposed strategy is shown toprovide greater increases in accuracy using a smaller number of annotated wordsthan traditional sentence-based annotation techniques. |
Language |
Tools, systems, applications |
Topics |
Corpus (creation, annotation, etc.), Statistical and machine learning methods, Tools, systems, applications |
Full paper  |
Word-based Partial Annotation for Efficient Corpus Construction |
Bibtex |
@InProceedings{NEUBIG10.408,
author = {Graham Neubig and Shinsuke Mori}, title = {Word-based Partial Annotation for Efficient Corpus Construction}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |