Summary of the paper

Title Towards a Large Parallel Corpus of Cleft Constructions
Authors Gerlof Bouma, Lilja Øvrelid and Jonas Kuhn
Abstract We present our efforts to create a large-scale, semi-automatically annotatedparallel corpus of cleft constructions. The corpus is intended to reduce ormake more effective the manual task of finding examples of clefts in a corpus.The corpus is being developed in the context of the Collaborative ResearchCentre SFB 632, which is a large, interdisciplinary research initiative tostudy information structure, at the University of Potsdam and the HumboldtUniversity in Berlin.The corpus is based on the Europarl corpus (version 3). We show howstate-of-the-art NLP tools, like POS taggers and statistical dependencyparsers, may facilitate powerful and precise searches. We argue thatidentifying clefts using automatically added syntactic structure annotation isultimately to be preferred over using lower level, though more robust,extraction methods like regular expression matching. An evaluation of theextraction method for one of the languages also offers some support for thismethod.We end the paper by discussing the resulting corpus itself. We present someexamples of interesting clefts and translational counterparts from the corpusand suggest ways of exploiting our newly created resource in thecross-linguistic study of clefts.
Language Grammar and Syntax
Topics Corpus (creation, annotation, etc.), Discourse annotation, representation and processing, Grammar and Syntax
Full paper Towards a Large Parallel Corpus of Cleft Constructions
Bibtex @InProceedings{BOUMA10.291,
  author = {Gerlof Bouma, Lilja Øvrelid and Jonas Kuhn},
  title = {Towards a Large Parallel Corpus of Cleft Constructions},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA