ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
A novel Arabic lemmatization algorithm
Full text PdfPdf (152 KB)
Source AND; Vol. 303 archive
Proceedings of the second workshop on Analytics for noisy unstructured text data table of contents
Singapore
Pages: 113-118  
Year of Publication: 2008
ISBN:978-1-60558-196-5
Authors
Eiman Al-Shammari  Kuwait University, Fairfax, VA
Jessica Lin  George Mason University, Fairfax, VA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 150,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1390749.1390767
What is a DOI?

ABSTRACT

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, lemmatization and stemming.

Both stemming and lemmatization share a common goal of reducing a word to its base. However, lemmatization is more robust than stemming as it often involves usage of vocabulary and morphological analysis, as opposed to simply removing the suffix of the word. In this work, we introduce a novel lemmatization algorithm for the Arabic Language.

The new lemmatizer proposed here is a part of a comprehensive Arabic tokenization system, with a stop words list exceeding 2200 Arabic words. Currently, there are two Arabic leading stemmers: the root-based stemmer and the light stemmer. We hypothesize that lemmatization would be more effective than stemming in mining Arabic text. We investigate the impact of our new lemmatizer on unsupervised data mining techniques in comparison to the leading Arabic stemmers. We conclude that lemmatization is a better word normalization method than stemming for Arabic text.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
W. B. Frakes, "Stemming algorithms," 1992.
 
2
I. A. Al-Kharashi, "Micro-AIRS: A microcomputer-based Arabic information retrieval system comparing words, stems, and roots as index terms," 1991.
 
3
 
4
L. S. Larkey and M. E. Connell, "Arabic Information Retrieval at UMass in TREC-10," Proceedings of the Tenth Text REtrieval Conference (TREC-10)", EM Voorhees and DK Harman ed, 2001, pp. 562--570.
5
6
 
7
S. Khoja and R. Garside, "Stemming Arabic Text," Lancaster, UK, Computing Department, Lancaster University, 1999.
 
8
R. Duwairi, "A Distance-based Classifier for Arabic Text Categorization," Proceedings of the 2005 International Conference on Data Mining, Las Vegas USA, 2005.
 
9
M. El Kourdi, A. Bensaid, and T. Rachidi, "Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm," COLING 2004.
 
10
 
11
 
12
"Snowball: A language for stemming algorithms"; http://snowball.tartarus.org/texts/introduction.html.
 
13
S. S. Al-Fedaghi and F. Al-Anzi, "A New Algorithm to Generate Arabic Root-Pattern Forms," Proceedings of the 11th National Computer Conference and Exhibition, 1989, pp. 391--400.
14
 
15
M. BOOT, "Homography and Lemmatization in Dutch Texts," ALLC Bulletin, vol. 8, 1980, pp. 175--189.
 
16
Eiman Al-Shammari and J. Lin, "Automated Corpora Creation Using A novel Arabic Stemming Algorithm," The 2008 International Symposium on Using Corpora in Contrastive and Translation Studies (UCCTS), Hangzhou, China: 2008.
 
17
 
18
M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering techniques," KDD Workshop on Text Mining, vol. 34, 2000, p. 35.
 
19
Y. Zhao and G. Karypis, "Criterion Functions for Document Clustering," Experiments and Analysis University of Minnesota, Department of Computer Science/Army HPC Research Center.
 
20
E. Al-Shammari, "Towards an Error Free Stemming," IADIS European Conference on Data Mining (ECDM 2008), Amsterdam, The Netherlands: 2008.


Collaborative Colleagues:
Eiman Al-Shammari: colleagues
Jessica Lin: colleagues