Summary of the paper

Title STeP-1: A Set of Fundamental Tools for Persian Text Processing
Authors Mehrnoush Shamsfard, Hoda Sadat Jafari and Mahdi Ilbeygi
Abstract Many NLP applications need fundamental tools to convert the input text intoappropriate form or format and extract the primary linguistic knowledge ofwords and sentences. These tools perform segmentation of text into sentences,words and phrases, checking and correcting the spellings, doing lexical andmorphological analysis, POS tagging and so on. Persian is among languages with complex preprocessing tasks. Having differentwriting prescriptions, spacings between or within words, character codings andspellings are some of the difficulties and challenges in converting varioustexts into a standard one. The lack of fundamental text processing tools suchas morphological analyser (especially for derivational morphology) and POStagger is another problem in Persian text processing.This paper introduces a set of fundamental tools for Persian text processing inSTeP-1 package. STeP-1 (Standard Text Preparation for Persian language)performs a combination of tokenization, spell checking, morphological analysisand POS tagging. It also turns all Persian texts with different prescribedforms of writing to a series of tokens in the standard style introduced byAcademy of Persian Language and Literature (APLL). Experimental results showhigh performance.
Language Part of speech tagging
Topics Tools, systems, applications, Morphology, Part of speech tagging
Full paper STeP-1: A Set of Fundamental Tools for Persian Text Processing
Bibtex @InProceedings{SHAMSFARD10.809,
  author = {Mehrnoush Shamsfard, Hoda Sadat Jafari and Mahdi Ilbeygi},
  title = {STeP-1: A Set of Fundamental Tools for Persian Text Processing},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA