Summary of the paper

Title Annotation and Representation of a Diachronic Corpus of Spanish
Authors Cristina Sánchez-Marco, Gemma Boleda, Josep Maria Fontana and Judith Domingo
Abstract In this article we describe two different strategies for the automatic taggingof a Spanish diachronic corpus involving the adaptation of existing NLP toolsdeveloped for modern Spanish. In the initial approach we follow astate-of-the-art strategy, which consists on standardizing the spelling and thelexicon. This approach boosts POS-tagging accuracy to 90, which represents araw improvement of over 20% with respect to the results obtained without anypre-processing. In order to enable non-expert users in NLP to use this newresource, the corpus has been integrated into IAC (Corpora Interface Access). We discuss the shortcomings of the initial approach and propose a new one,which does not consist in adapting the source texts to the tagger, but ratherin modifying the tagger for the direct treatment of the old variants.Thissecond strategy addresses some important shortcomings in the previous approachand is likely to be useful not only in the creation of diachronic linguisticresources but also for the treatment of dialectal or non-standard variants ofsynchronic languages as well.
Language Metadata
Topics Corpus (creation, annotation, etc.), LR Infrastructures and Architectures, Metadata
Full paper Annotation and Representation of a Diachronic Corpus of Spanish
Bibtex @InProceedings{SNCHEZMARCO10.535,
  author = {Cristina Sánchez-Marco, Gemma Boleda, Josep Maria Fontana and Judith Domingo},
  title = {Annotation and Representation of a Diachronic Corpus of Spanish},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA