Summary of the paper

Title Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance
Authors Siim Orasmaa, Reina Käärik, Jaak Vilo and Tiit Hennoste
Abstract An important feature of spoken language corpora is existence of differentspelling variants of words in transcription. So there is an important problemfor linguist who works with large spoken corpora: how to find all variants ofthe word without annotating them manually? Our work describes a search enginethat enables finding different spelling variants (true positives) from corpusof spoken language, and reduces efficiently the amount of false positivesreturned during the search. Our search engine uses a generalized variant of theedit distance algorithm that allows defining text-specific string to stringtransformations in addition to the default edit operations defined in editdistance. We have extended our algorithm with capability to blocktransformations in specific substrings of search words. User can mark certainregions (blocked regions) of the search word where edit operations are notallowed. Our material comes from the Corpus of Spoken Estonian of theUniversity of Tartu which consists of about 2000 dialogues and texts, about 1.4million running text units in total.
Language Lexicon, lexical database
Topics Corpus (creation, annotation, etc.), Tools, systems, applications, Lexicon, lexical database
Full paper Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance
Bibtex @InProceedings{ORASMAA10.600,
  author = {Siim Orasmaa, Reina Käärik, Jaak Vilo and Tiit Hennoste},
  title = {Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA