Title |
An Annotated Dataset for Extracting Definitions and Hypernyms from the Web |
Authors |
Roberto Navigli, Paola Velardi and Juana María Ruiz-Martínez |
Abstract |
This paper presents and analyzes an annotated corpus of definitions, created totrain an algorithm for the automatic extraction of definitions and hypernymsfrom web documents. As an additional resource, we also include a corpus ofnon-definitions with syntactic patterns similar to those of definitionsentences, e.g.: "An android is a robot" vs. "Snowcap is unmistakable". Domain and style independence is obtained thanks to the annotation of a largeand domain-balanced corpus and to a novel pattern generalization algorithmbased on word-class lattices (WCL). A lattice is a directed acyclic graph(DAG), a subclass of nondeterministic finite state automata (NFA). The latticestructure has the purpose of preserving the salient differences among distinctsequences, while eliminating redundant information. The WCL algorithm will beintegrated into an improved version of the GlossExtractor Web application(Velardi et al., 2008). This paper is mostly concerned with a description ofthe corpus, the annotation strategy, and a linguistic analysis of the data. Asummary of the WCL algorithm is also provided for the sake of completeness. |
Language |
Information Extraction, Information Retrieval |
Topics |
Corpus (creation, annotation, etc.), Semantics, Information Extraction, Information Retrieval |
Full paper  |
An Annotated Dataset for Extracting Definitions and Hypernyms from the Web |
Bibtex |
@InProceedings{NAVIGLI10.20,
author = {Roberto Navigli, Paola Velardi and Juana María Ruiz-Martínez}, title = {An Annotated Dataset for Extracting Definitions and Hypernyms from the Web}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |