Title |
Towards a Balanced Named Entity Corpus for Dutch |
Authors |
Bart Desmet and Véronique Hoste |
Abstract |
This paper introduces a new named entity corpus for Dutch. State-of-the-artnamed entity recognition systems require a substantial annotated corpus to betrained on. Such corpora exist for English, but not for Dutch. TheSTEVIN-funded SoNaR project aims to produce a diverse 500-million-wordreference corpus of written Dutch, with four semantic annotation layers: namedentities, coreference relations, semantic roles and spatiotemporal expressions.A 1-million-word subset will be manually corrected. Named entity annotationguidelines for Dutch were developed, adapted from the MUC and ACE guidelines.Adaptations include the annotation of products and events, the classificationinto subtypes, and the markup of metonymic usage. Inter-annotator agreementexperiments were conducted to corroborate the reliability of the guidelines,which yielded satisfactory results (Kappa scores above 0.90). We are building aNER system, trained on the 1-million-word subcorpus, to automatically classifythe remainder of the SoNaR corpus. To this end, experiments with variousclassification algorithms (MBL, SVM, CRF) and features have been carried outand evaluated. |
Language |
LR national/international projects, organizational/policy issues |
Topics |
Named Entity recognition, Corpus (creation, annotation, etc.), LR national/international projects, organizational/policy issues |
Full paper  |
Towards a Balanced Named Entity Corpus for Dutch |
Bibtex |
@InProceedings{DESMET10.210,
author = {Bart Desmet and Véronique Hoste}, title = {Towards a Balanced Named Entity Corpus for Dutch}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |