LREC 2010 Proceedings

Summary of the paper

Title	A Named Entity Labeler for German: Exploiting Wikipedia and Distributional Clusters
Authors	Grzegorz Chrupała and Dietrich Klakow
Abstract	Named Entity Recognition is a relatively well-understood NLP task,with many publicly available training resources and software forprocessing English data. Other languages tend to be underserved inthis area. For German, CoNLL-2003 Shared Task provided training data,but there are no publicly available, ready-to-use tools. We fill thisgap and develop a German NER system with state-of-the-artperformance. In addition to CoNLL 2003 labeled training data, we usetwo additional resources: (i) 32 million words of unlabeled newsarticle text and (ii) infobox labels from German Wikipedia articles.From the unlabeled text we derive distributional word clusters. Thenwe use cluster membership features and Wikipedia infobox labelfeatures to train a supervised model on the labeled trainingdata. This approach allows us to deal better with word-types unseen inthe training data and achieve good performance on Germanwith little engineering effort.
Language	Tools, systems, applications
Topics	Named Entity recognition, Multilinguality, Tools, systems, applications
Full paper	A Named Entity Labeler for German: Exploiting Wikipedia and Distributional Clusters
Bibtex	@InProceedings{CHRUPAA10.538, author = {Grzegorz Chrupała and Dietrich Klakow}, title = {A Named Entity Labeler for German: Exploiting Wikipedia and Distributional Clusters}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }