LREC 2010 Proceedings

Summary of the paper

Title	Language Identification of Short Text Segments with N-gram Models
Authors	Tommi Vatanen, Jaakko J. Väyrynen and Sami Virpioja
Abstract	There are many accurate methods for language identification of long textsamples, but identification of very short strings still presents a challenge.This paper studies a language identification task, in which the test sampleshave only 5-21 characters. We compare two distinct methods that are well suitedfor this task: a naive Bayes classifier based on character n-gram models, andthe ranking method by Cavnar and Trenkle (1994). For the n-gram models, we testseveral standard smoothing techniques, including the current state-of-the-art,the modified Kneser-Ney interpolation. Experiments are conducted with 281languages using the Universal Declaration of Human Rights. Advanced languagemodel smoothing techniques improve the identification accuracy and therespective classifiers outperform the ranking method. The higher accuracy isobtained at the cost of larger models and slower classification speed. However,there are several methods to reduce the size of an n-gram model, and ourexperiments with model pruning show that it provides an easy way to balance thesize and the identification accuracy. We also compare the results to thelanguage identifier in Google AJAX Language API, using a subset of 50languages.
Language	Language modelling
Topics	Language Identification, Statistical and machine learning methods, Language modelling
Full paper	Language Identification of Short Text Segments with N-gram Models
Bibtex	@InProceedings{VATANEN10.279, author = {Tommi Vatanen, Jaakko J. Väyrynen and Sami Virpioja}, title = {Language Identification of Short Text Segments with N-gram Models}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }