Title |
The English-Swedish-Turkish Parallel Treebank |
Authors |
Beáta Megyesi, Bengt Dahlqvist, Éva Á. Csató and Joakim Nivre |
Abstract |
We describe a syntactically annotated parallel corpus containing typologicallypartly different languages, namely English, Swedish and Turkish. The corpusconsists of approximately 300 000 tokens in Swedish, 160 000 in Turkish and 150000 in English, containing both fiction and technical documents. We build thecorpus by using the Uplug toolkit for automatic structural markup, such astokenization and sentence segmentation, as well as sentence and word alignment.In addition, we use basic language resource kits for the linguistic analysis ofthe languages involved. The annotation is carried on various layers frommorphological and part of speech analysis to dependency structures. The toolsused for linguistic annotation, e.g.,\ HunPos tagger and MaltParser, are freelyavailable data-driven resources, trained on existing corpora and treebanks foreach language. The parallel treebank is used in teaching and linguisticresearch to study the relationship between the structurally differentlanguages. In order to study the treebank, several tools have been developedfor the visualization of the annotation and alignment, allowing search forlinguistic patterns. |
Language |
Grammar and Syntax |
Topics |
Corpus (creation, annotation, etc.), Tools, systems, applications, Grammar and Syntax |
Full paper  |
The English-Swedish-Turkish Parallel Treebank |
Bibtex |
author = {Beáta Megyesi, Bengt Dahlqvist, Éva Á. Csató and Joakim Nivre}, title = {The English-Swedish-Turkish Parallel Treebank}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |