LREC 2010 Proceedings

Summary of the paper

Title	Towards a Balanced Named Entity Corpus for Dutch
Authors	Bart Desmet and Véronique Hoste
Abstract	This paper introduces a new named entity corpus for Dutch. State-of-the-artnamed entity recognition systems require a substantial annotated corpus to betrained on. Such corpora exist for English, but not for Dutch. TheSTEVIN-funded SoNaR project aims to produce a diverse 500-million-wordreference corpus of written Dutch, with four semantic annotation layers: namedentities, coreference relations, semantic roles and spatiotemporal expressions.A 1-million-word subset will be manually corrected. Named entity annotationguidelines for Dutch were developed, adapted from the MUC and ACE guidelines.Adaptations include the annotation of products and events, the classificationinto subtypes, and the markup of metonymic usage. Inter-annotator agreementexperiments were conducted to corroborate the reliability of the guidelines,which yielded satisfactory results (Kappa scores above 0.90). We are building aNER system, trained on the 1-million-word subcorpus, to automatically classifythe remainder of the SoNaR corpus. To this end, experiments with variousclassification algorithms (MBL, SVM, CRF) and features have been carried outand evaluated.
Language	LR national/international projects, organizational/policy issues
Topics	Named Entity recognition, Corpus (creation, annotation, etc.), LR national/international projects, organizational/policy issues
Full paper	Towards a Balanced Named Entity Corpus for Dutch
Bibtex	@InProceedings{DESMET10.210, author = {Bart Desmet and Véronique Hoste}, title = {Towards a Balanced Named Entity Corpus for Dutch}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }