Title |
NP Alignment in Bilingual Corpora |
Authors |
Gabor Recski, András Rung, Attila Zséder and András Kornai |
Abstract |
Aligning the NPs of parallel corpora is logically halfway between the sentence-and word-alignment tasks that occupy much of the MT literature, but hasreceived far less attention. NP alignment is a challenging problem, capable ofrapidly exposing flaws both in the word-alignment and in the NP chunkingalgorithms one may bring to bear. It is also a very rewarding problem in thatNPs are semantically natural translation units, which means that (i) wordalignments will cross NP boundaries only exceptionally, and (ii) withinsentences already aligned, the proportion of 1-1 alignments will be higher forNPs than words. We created a simple gold standard for English-Hungarian,Orwells 1984, (since this already exists in manually verified POS-taggedformat in many languages thanks to the Multex and MultexEast project) bymanually verifying the automaticaly generated NP chunking (we used the yamcha,mallet and hunchunk taggers) and manually aligning the maximal NPs and PPs. Themaximum NP chunking problem is much harder than base NP chunking, withF-measure in the .7 range (as opposed to over .94 for base NPs). Since theresults are highly impacted by the quality of the NP chunking, we tested ouralignment algorithms both with real world (machine obtained) chunkings, whereresults are in the .35 range for the baseline algorithm which propagates GIZA++word alignments to the NP level, and on idealized (manually obtained)chunkings, where the baseline reaches .4 and our current system reaches .64. |
Language |
Parsing |
Topics |
Corpus (creation, annotation, etc.), Machine Translation, SpeechToSpeech Translation, Parsing |
Full paper  |
NP Alignment in Bilingual Corpora |
Bibtex |
@InProceedings{RECSKI10.531,
author = {Gabor Recski, András Rung, Attila Zséder and András Kornai}, title = {NP Alignment in Bilingual Corpora}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |