Summary of the paper

Title The CALBC Silver Standard Corpus for Biomedical Named Entities ― A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers
Authors Dietrich Rebholz-Schuhmann, Antonio José Jimeno-Yepes, Erik M. van Mulligen, Ning Kang, Jan Kors, David Milward, Peter Corbett, Ekaterina Buyko, Katrin Tomanek, Elena Beisswanger and Udo Hahn
Abstract The production of gold standard corpora is time-consuming and costly. Wepropose an alternative: the ‚silver standard corpus‘ (SSC), a corpus thathas been generated by the harmonisation of the annotations that have beendelivered from a selection of annotation systems. The systems have to share thetype system for the annotations and the harmonisation solution has use asuitable similarity measure for the pair-wise comparison of the annotations.The annotation systems have been evaluated against the harmonised set (630.324sentences, 15,956,841 tokens).We can demonstrate that the annotation of proteins and genes shows higherdiversity across all used annotation solutions leading to a lower agreementagainst the harmonised set in comparison to the annotations of diseases andspecies. An analysis of the most frequent annotations from all systems showsthat a high agreement amongst systems leads to the selection of terms that aresuitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus fromautomated annotation systems. Further research is required to understand, howthe annotations from different systems have to be combined to produce the bestannotation result for a harmonised corpus.
Language Text mining
Topics Corpus (creation, annotation, etc.), Named Entity recognition, Text mining
Full paper The CALBC Silver Standard Corpus for Biomedical Named Entities ― A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers
Bibtex @InProceedings{REBHOLZSCHUHMANN10.888,
  author = {Dietrich Rebholz-Schuhmann, Antonio José Jimeno-Yepes, Erik M. van Mulligen, Ning Kang, Jan Kors, David Milward, Peter Corbett, Ekaterina Buyko, Katrin Tomanek, Elena Beisswanger and Udo Hahn},
  title = {The CALBC Silver Standard Corpus for Biomedical Named Entities ― A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA