Summary of the paper

Title A Python Toolkit for Universal Transliteration
Authors Ting Qian, Kristy Hollingshead, Su-youn Yoon, Kyoung-young Kim and Richard Sproat
Abstract We describe ScriptTranscriber, an open source toolkit for extractingtransliterations in comparable corpora from languages written in differentscripts. The system includes various methods for extracting potential terms ofinterest from raw text, for providing guesses on the pronunciations of terms,and for comparing two strings as possible transliterations using both phoneticand temporal measures. The system works with any script in the Unicode BasicMultilingual Plane and is easily extended to include new modules. Givencomparable corpora, such as newswire text, in a pair of languages that usedifferent scripts, ScriptTranscriber provides an easy way to minetransliterations from the comparable texts. This is particularly useful forunderresourced languages, where training data for transliteration may belacking, and where it is thus hard to train good transliterators.ScriptTranscriber provides an open source package that allows for readyincorporation of more sophisticated modules ― e.g. a trained transliterationmodel for a particular language pair. ScriptTranscriber is available as part ofthe nltk contrib source tree at http://code.google.com/p/nltk/.
Language Named Entity recognition
Topics Tools, systems, applications, Multilinguality, Named Entity recognition
Full paper A Python Toolkit for Universal Transliteration
Bibtex @InProceedings{QIAN10.30,
  author = {Ting Qian, Kristy Hollingshead, Su-youn Yoon, Kyoung-young Kim and Richard Sproat},
  title = {A Python Toolkit for Universal Transliteration},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA