Title |
Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text |
Authors |
Majdi Sawalha and Eric Atwell |
Abstract |
Morphological analyzers and part-of-speech taggers are key technologies formost text analysis applications. Our aim is to develop a part-of-speech taggerfor annotating a wide range of Arabic text formats, domains and genresincluding both vowelized and non-vowelized text. Enriching the text withlinguistic analysis will maximize the potential for corpus re-use in a widerange of applications. We foresee the advantage of enriching the text withpart-of-speech tags of very fine-grained grammatical distinctions, whichreflect expert interest in syntax and morphology, but not specific needs ofend-users, because end-user applications are not known in advance. In thispaper we review existing Arabic Part-of-Speech Taggers and tag-sets, andillustrate four different Arabic PoS tag-sets for a sample of Arabic text fromthe Quran. We describe the detailed fine-grained morphological feature tag setof Arabic, and the fine-grained Arabic morphological analyzer algorithm. Wefaced practical challenges in applying the morphological analyzer to the100-million-word Web Arabic Corpus: we had to port the software to the NationalGrid Service, adapt the analyser to cope with spelling variations and errors,and utilise a Broad-Coverage Lexical Resource combining 23 traditional Arabiclexicons. Finally we outline the construction of a Gold Standard forcomparative evaluation. |
Language |
Tools, systems, applications |
Topics |
Part of speech tagging, Morphology, Tools, systems, applications |
Full paper  |
Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text |
Bibtex |
@InProceedings{SAWALHA10.282,
author = {Majdi Sawalha and Eric Atwell}, title = {Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |