Title |
Towards a Large Parallel Corpus of Cleft Constructions |
Authors |
Gerlof Bouma, Lilja Øvrelid and Jonas Kuhn |
Abstract |
We present our efforts to create a large-scale, semi-automatically annotatedparallel corpus of cleft constructions. The corpus is intended to reduce ormake more effective the manual task of finding examples of clefts in a corpus.The corpus is being developed in the context of the Collaborative ResearchCentre SFB 632, which is a large, interdisciplinary research initiative tostudy information structure, at the University of Potsdam and the HumboldtUniversity in Berlin.The corpus is based on the Europarl corpus (version 3). We show howstate-of-the-art NLP tools, like POS taggers and statistical dependencyparsers, may facilitate powerful and precise searches. We argue thatidentifying clefts using automatically added syntactic structure annotation isultimately to be preferred over using lower level, though more robust,extraction methods like regular expression matching. An evaluation of theextraction method for one of the languages also offers some support for thismethod.We end the paper by discussing the resulting corpus itself. We present someexamples of interesting clefts and translational counterparts from the corpusand suggest ways of exploiting our newly created resource in thecross-linguistic study of clefts. |
Language |
Grammar and Syntax |
Topics |
Corpus (creation, annotation, etc.), Discourse annotation, representation and processing, Grammar and Syntax |
Full paper  |
Towards a Large Parallel Corpus of Cleft Constructions |
Bibtex |
@InProceedings{BOUMA10.291,
author = {Gerlof Bouma, Lilja Øvrelid and Jonas Kuhn}, title = {Towards a Large Parallel Corpus of Cleft Constructions}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |