ACM Home Page
Please provide us with feedback. Feedback
Concept search in Urdu
Full text PdfPdf (175 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 2nd PhD workshop on Information and knowledge management table of contents
Napa Valley, California, USA
SESSION: Session 2 table of contents
Pages 33-40  
Year of Publication: 2008
ISBN:978-1-60558-257-3
Author
Kashif Riaz  University of Minnesota, Minneapolis, MN, USA
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 61,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458550.1458557
What is a DOI?

ABSTRACT

This paper describes a thesis proposal to do concept search in non English and non European languages. Urdu is chosen as an example language because of its unique nature, morphology and a large number of speakers. Besides its importance, Urdu does not have adequate language resources to do intellectual research in Information Retrieval (IR). It is shown that methods used for English language for concept searching are inadequate for Urdu. Some novel approaches for concept searching are also presented. Pre-processing IR tasks such as stop word identification and stemming require complex research for a morphological rich language like Urdu. Named-entity identification is hypothesized to be useful in determining the concept being sought by the user and research plan includes an implementation of named-entity identification for Urdu. An Urdu language toolkit will be made available to the IR community for Urdu language processing. Finally, a TREC like evaluation criteria is presented with relevance judgments, test collection and queries for Urdu IR.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
D. Becker, B. Bennett, E. Davis, D. Panton, and K. Riaz. "Named Entity Recognition in Urdu: A Progress Report". Proceedings of the 2002 International Conference on Internet Computing. June 2002.
 
2
 
3
P. Baker, A. Hardie, T. McEnery, and B.D. Jayaram. "Corpus Data for South Asian Language Processing". Proceedings of the 10th Annual Workshop for South Asian Language Processing, EACL 2003.
 
4
 
5
S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman. "Indexing by Latent Semantic Analysis." Journal of the American Society of Information Science, vol 41. no. 6. p. 391--407. 1990.
 
6
T. Landauer, P. Foltz, Lahman, D. "A Introduction to Latent Semantic Analysis." Discourse Processes, 25, 259--284. 1998
 
7
H. Haav, T. Lubi. "A Survey of Concept-Based Information Retrieval Tools on the Web." Fifth East-European Conference on Advances in Databases and Information Systems. 2002.
 
8
 
9
R. K. Belew. "Finding Out About". Cambridge University Press, 2000.
 
10
 
11
N. Ide, C. Brew. "Requirements, Tools, and Architectures for Annotated Corpora". Proceedings of Data Architectures and Software Support for Large Corpora. European Language Resources Association, Paris, 2000.
 
12
R. Lo, B. He, I. Ounis. "Automatically Building a Stop word List for an Information Retrieval System". 5th Dutch-Belgium Information Retrieval Workshop (DIIR). 2005.
 
13
 
14
Z. Xiao, A. McEnery, P. Baker and A. Hardie. "Developing Asian Language Corpora: Standards and Practice", Proceedings of the 4th Workshop on Asian Language Resources. March 25, 2004. Sanya, China.
 
15
W0038: The EMILLE Lancaster Corpus. {cited 2005 July 15}, Available: http://www.elda.org/catalogue/en/text/W0038.html
 
16
R. Schmidt. "Urdu: An Essential Grammar." Routlege Publishing, 2005
 
17
A. Singal. "Modern Information Retrieval" IEEE Data Engineering, 2001
 
18
19
 
20
 
21
M. Deniston. "An Overview and Discussion of Concept Search Models and Technologies." Engenium's Semetric (White Paper). 2003.
 
22
Lucene. http://lucene.apache.org/ (July, 2008)
 
23
The Lemur Toolkit for Language Modeling and Information Retrieval. http://www.lemurproject.org/ (July 2008)
 
24
Terrier .http://ir.dcs.gla.ac.uk/terrier/ (July 2008)
 
25
Pedersen, Patwardhan, Michelizzi. "WordNet::Similarity -- Measuring the Relatedness of Concepts", Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), pp. 1024--1025, San Jose, CA, July 25--29, 2004
 
26
Song, X.; Lin, C.-Y.; Sun, M.-T. "Speech-Based Visual Concept Learning Using Wordnet", Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on Volume 6, Issue 6, Page(s):1138--1141, July 2005
 
27
K. Riaz, "Challenges in Urdu Stemming" Future Directions in Information Access. Glasgow, August 2007
 
28
K. Riaz, "Stop Word Identification in Urdu" ,Conference of Language and Technology, Bara Gali, Pakistan, August 2007
 
29
30
 
31
P. Majumder, M. Mitra, S. K. Parui, P. Bhattacharyya. The First International Workshop on Evaluating Information Access (EVIA 2007) Tokyo, Japan, May 15, 2007.
 
32
I. Moulinier, P. Jackson, "Natural Language Processing for Online Applications, Text Retrieval, Extraction and Categorization", 2 nd Edition, John Benjamins Publishing Company, 2007
 
33