LREC 2010 Proceedings

Summary of the paper

Title	Efficiently Extract Rrecurring Tree Fragments from Large Treebanks
Authors	Federico Sangati, Willem Zuidema and Rens Bod
Abstract	In this paper we describe FragmentSeeker, a tool which is capable to identifyall those tree constructions which are recurring multiple times in a largePhrase Structure treebank. The tool is based on an efficient kernel-baseddynamic algorithm, which compares every pair of trees of a given treebank andcomputes the list of fragments which they both share. We describe two differentnotions of fragments we will use, i.e. standard and partial fragments, andprovide the implementation details on how to extract them from a syntacticallyannotated corpus. We have tested our system on the Penn Wall Street Journaltreebank for which we present quantitative and qualitative analysis on theobtained recurring structures, as well as provide empirical time performance.Finally we propose possible ways our tool could contribute to differentresearch fields related to corpus analysis and processing, such as parsing,corpus statistics, annotation guidance, and automatic detection of argumentstructure.
Language	Corpus (creation, annotation, etc.)
Topics	Tools, systems, applications, Grammar and Syntax, Corpus (creation, annotation, etc.)
Full paper	Efficiently Extract Rrecurring Tree Fragments from Large Treebanks
Bibtex	@InProceedings{SANGATI10.613, author = {Federico Sangati, Willem Zuidema and Rens Bod}, title = {Efficiently Extract Rrecurring Tree Fragments from Large Treebanks}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }