Title |
Constructing and Using Broad-coverage Lexical Resource for Enhancing Morphological Analysis of Arabic |
Authors |
Majdi Sawalha and Eric Atwell |
Abstract |
Broad-coverage language resources which provide prior linguistic knowledge mustimprove the accuracy and the performance of NLP applications. We areconstructing a broad-coverage lexical resource to improve the accuracy ofmorphological analyzers and part-of-speech taggers of Arabic text. Over thepast 1200 years, many different kinds of Arabic language lexicons wereconstructed; these lexicons are different in ordering, size and aim or goal ofconstruction. We collected 23 machine-readable lexicons, which are freelyavailable on the web. We combined lexical resources into one largebroad-coverage lexical resource by extracting information from disparateformats and merging traditional Arabic lexicons.To evaluate the broad-coverage lexical resource we computed coverage over theQuran, the Corpus of Contemporary Arabic, and a sample from the Arabic WebCorpus, using two methods. Counting exact word matches between test corpora andlexicon scored about 65-68%; Arabic has a rich morphology with manycombinations of roots, affixes and clitics, so about a third of words in thecorpora did not have an exact match in the lexicon. The second approach is tocompute coverage in terms of use in a lemmatizer program, which strips cliticsto look for a match for the underlying lexeme; this scored about 82-85%. |
Language |
Evaluation methodologies |
Topics |
Lexicon, lexical database, Morphology, Evaluation methodologies |
Full paper  |
Constructing and Using Broad-coverage Lexical Resource for Enhancing Morphological Analysis of Arabic |
Bibtex |
@InProceedings{SAWALHA10.287,
author = {Majdi Sawalha and Eric Atwell}, title = {Constructing and Using Broad-coverage Lexical Resource for Enhancing Morphological Analysis of Arabic}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |