Summary of the paper

Title Quality Indicators of LSP Texts ― Selection and Measurements Measuring the Terminological Usefulness of Documents for an LSP Corpus
Authors Jakob Halskov, Dorte Haltrup Hansen, Anna Braasch and Sussi Olsen
Abstract This paper describes and evaluates a prototype quality assurance system for LSPcorpora. The system will be employed in compiling a corpus of 11 M tokens forvarious linguistic and terminological purposes. The system utilizes a number oflinguistic features as quality indicators. These represent two dimensions ofquality, namely readability/formality (e.g. word length and passiveconstructions) and density of specialized knowledge (e.g. out-of-vocabularyitems). Threshold values for each indicator are induced from a reference corpusof general (fiction, magazines and newspapers) and specialized language (thedomains of Health/Medicine and Environment/Climate). In order to test theefficiency of the indicators, a number of terminologically relevant, irrelevantand possibly relevant texts are manually selected from target web sites ascandidate texts. By applying the indicators to these candidate texts, thesystem is able to filter out non-LSP and “poor” LSP texts with a precisionof 100% and a recall of 55%. Thus, the experiment described in this paperconstitutes fundamental work towards a formulation of ‘best practice’ forimplementing quality assurance when selecting appropriate texts for an LSPcorpus. The domain independence of the quality indicators still remains to bethoroughly tested on more than just two domains.
Language Other
Topics Corpus (creation, annotation, etc.), Information Extraction, Information Retrieval, Other
Full paper Quality Indicators of LSP Texts ― Selection and Measurements Measuring the Terminological Usefulness of Documents for an LSP Corpus
Bibtex @InProceedings{HALSKOV10.505,
  author = {Jakob Halskov, Dorte Haltrup Hansen, Anna Braasch and Sussi Olsen},
  title = {Quality Indicators of LSP Texts ― Selection and Measurements Measuring the Terminological Usefulness of Documents for an LSP Corpus},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA