Title |
Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History |
Authors |
Aurélien Max and Guillaume Wisniewski |
Abstract |
Naturally-occurring instances of linguistic phenomena are important both fortraining and for evaluating automatic text processing. When available in largequantities, they also prove interesting material for linguistic studies. Inthis article, we present WiCoPaCo (Wikipedia Correction and Paraphrase Corpus),a new freely-available resource built by automatically mining Wikipediasrevision history. The WiCoPaCo corpus focuses on local modifications made byhuman revisors and include various types of corrections (such as spelling erroror typographical corrections) and rewritings, which can be categorized broadlyinto meaning-preserving and meaning-altering revisions. We present an initialhand-built typology of these revisions, but the resource allows for anypossible annotation scheme. We discuss the main motivations for building such aresource and describe the main technical details guiding its construction. Wealso present applications and data analysis on French and report initialresults on spelling error correction and morphosyntactic rewriting. TheWiCoPaCo corpus can be freely downloaded from http://wicopaco.limsi.fr. |
Language |
Authoring tools, proofing |
Topics |
Corpus (creation, annotation, etc.), Textual Entailment and Paraphrasing, Authoring tools, proofing |
Full paper  |
Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History |
Bibtex |
@InProceedings{MAX10.827,
author = {Aurélien Max and Guillaume Wisniewski}, title = {Mining Naturally-occurring Corrections and Paraphrases from Wikipedias Revision History}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |