LREC 2010 Proceedings

Summary of the paper

Title	Arabic Word Segmentation for Better Unit of Analysis
Authors	Yassine Benajiba and Imed Zitouni
Abstract	The Arabic language has a very rich morphology where a word is composed of zeroor more prefixes, a stem and zero or more suffixes.This makes Arabic data sparse compared to other languages, such as English, andconsequently word segmentation becomes veryimportant for many Natural Language Processing tasks that deal with the Arabiclanguage. We present in this paper two segmentationschemes that are morphological segmentation and Arabic TreeBank segmentationand we show their impact on an important naturallanguage processing task that is mention detection. Experiments on ArabicTreeBank corpus show 98.1% accuracy on morphologicalsegmentation and 99.4% on morphological segmentation. We also discuss theimportance of segmenting the text; experiments showup to 6F points improvement of the mention detection system performance whenmorphological segmentation is used instead of notsegmenting the text. Obtained results also show up to 3F points improvement isachieved when the appropriate segmentation style isused.
Language	Named Entity recognition
Topics	Morphology, Statistical and machine learning methods, Named Entity recognition
Full paper	Arabic Word Segmentation for Better Unit of Analysis
Bibtex	@InProceedings{BENAJIBA10.54, author = {Yassine Benajiba and Imed Zitouni}, title = {Arabic Word Segmentation for Better Unit of Analysis}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }