Summary of the paper

Title A Corpus Factory for Many Languages
Authors Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS
Abstract For many languages there are no large, general-language corpora available.Until the web, all but the institutions could do little but shake their headsin dismay as corpus-building was long, slow and expensive. But with the adventof the Web it can be highly automated and thereby fast and inexpensive. We havedeveloped a ‘corpus factory’ where we build large corpora. In this paper wedescribe the method we use, and how it has worked, and how various problemswere solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish,Telugu, Thai and Vietnamese.We use the BootCaT method: we take a set of 'seed words' for the language fromWikipedia. Then, several hundred times over, we * randomly select three or four of the seed words * send as a query to Google or Yahoo or Bing, which returns a 'search hits'page * gather the pages that Google or Yahoo point to and save the text.This forms the corpus, which we then * 'clean' (to remove navigation bars, advertisements etc) * remove duplicates * tokenise and (if tools are available) lemmatise and part-of-speech tag * load into our corpus query tool, the Sketch Engine The corpora we have developed are available for use in the Sketch Engine corpusquery tool.
Language LR Infrastructures and Architectures
Topics Corpus (creation, annotation, etc.), Acquisition, LR Infrastructures and Architectures
Full paper A Corpus Factory for Many Languages
Bibtex @InProceedings{KILGARRIFF10.79,
  author = {Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS},
  title = {A Corpus Factory for Many Languages},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA