Summary of the paper

Title The Problems of Language Identification within Hugely Multilingual Data Sets
Authors Fei Xia, Carrie Lewis and William D. Lewis
Abstract As the data for more and more languages is finding its wayinto digital form, with an increasing amount of this data being posted to the Web, it has become possible to collect language data fromthe Web and create large multilingual resources, covering hundreds or even thousands of languages. ODIN, the Online Database of INterlinear text (Lewis, 2006), is such a resource. It currentlyconsists of nearly 200,000 data points for over 1,000 languages, the data for which was harvested from linguistic documents on the Web. We identifya number of issues with language identification for such broad-coverage resources including the lack of training data, ambiguous language names,incomplete language code sets, and incorrect uses of language names and codes.After providing a short overview of existing language code sets maintainedby the linguistic community, we discuss what linguists and the linguisticcommunity can do to make the process of language identification easier.
Language Endangered languages
Topics Corpus (creation, annotation, etc.), Multilinguality, Endangered languages
Full paper The Problems of Language Identification within Hugely Multilingual Data Sets
Bibtex @InProceedings{XIA10.921,
  author = {Fei Xia, Carrie Lewis and William D. Lewis},
  title = {The Problems of Language Identification within Hugely Multilingual Data Sets},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA