|
ABSTRACT
This paper describes a thesis proposal to do concept search in non English and non European languages. Urdu is chosen as an example language because of its unique nature, morphology and a large number of speakers. Besides its importance, Urdu does not have adequate language resources to do intellectual research in Information Retrieval (IR). It is shown that methods used for English language for concept searching are inadequate for Urdu. Some novel approaches for concept searching are also presented. Pre-processing IR tasks such as stop word identification and stemming require complex research for a morphological rich language like Urdu. Named-entity identification is hypothesized to be useful in determining the concept being sought by the user and research plan includes an implementation of named-entity identification for Urdu. An Urdu language toolkit will be made available to the IR community for Urdu language processing. Finally, a TREC like evaluation criteria is presented with relevance judgments, test collection and queries for Urdu IR.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
D. Becker, B. Bennett, E. Davis, D. Panton, and K. Riaz. "Named Entity Recognition in Urdu: A Progress Report". Proceedings of the 2002 International Conference on Internet Computing. June 2002.
|
| |
2
|
|
| |
3
|
P. Baker, A. Hardie, T. McEnery, and B.D. Jayaram. "Corpus Data for South Asian Language Processing". Proceedings of the 10th Annual Workshop for South Asian Language Processing, EACL 2003.
|
| |
4
|
|
| |
5
|
S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman. "Indexing by Latent Semantic Analysis." Journal of the American Society of Information Science, vol 41. no. 6. p. 391--407. 1990.
|
| |
6
|
T. Landauer, P. Foltz, Lahman, D. "A Introduction to Latent Semantic Analysis." Discourse Processes, 25, 259--284. 1998
|
| |
7
|
H. Haav, T. Lubi. "A Survey of Concept-Based Information Retrieval Tools on the Web." Fifth East-European Conference on Advances in Databases and Information Systems. 2002.
|
| |
8
|
|
| |
9
|
R. K. Belew. "Finding Out About". Cambridge University Press, 2000.
|
| |
10
|
|
| |
11
|
N. Ide, C. Brew. "Requirements, Tools, and Architectures for Annotated Corpora". Proceedings of Data Architectures and Software Support for Large Corpora. European Language Resources Association, Paris, 2000.
|
| |
12
|
R. Lo, B. He, I. Ounis. "Automatically Building a Stop word List for an Information Retrieval System". 5th Dutch-Belgium Information Retrieval Workshop (DIIR). 2005.
|
| |
13
|
|
| |
14
|
Z. Xiao, A. McEnery, P. Baker and A. Hardie. "Developing Asian Language Corpora: Standards and Practice", Proceedings of the 4th Workshop on Asian Language Resources. March 25, 2004. Sanya, China.
|
| |
15
|
W0038: The EMILLE Lancaster Corpus. {cited 2005 July 15}, Available: http://www.elda.org/catalogue/en/text/W0038.html
|
| |
16
|
R. Schmidt. "Urdu: An Essential Grammar." Routlege Publishing, 2005
|
| |
17
|
A. Singal. "Modern Information Retrieval" IEEE Data Engineering, 2001
|
| |
18
|
|
 |
19
|
|
| |
20
|
|
| |
21
|
M. Deniston. "An Overview and Discussion of Concept Search Models and Technologies." Engenium's Semetric (White Paper). 2003.
|
| |
22
|
Lucene. http://lucene.apache.org/ (July, 2008)
|
| |
23
|
The Lemur Toolkit for Language Modeling and Information Retrieval. http://www.lemurproject.org/ (July 2008)
|
| |
24
|
Terrier .http://ir.dcs.gla.ac.uk/terrier/ (July 2008)
|
| |
25
|
Pedersen, Patwardhan, Michelizzi. "WordNet::Similarity -- Measuring the Relatedness of Concepts", Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), pp. 1024--1025, San Jose, CA, July 25--29, 2004
|
| |
26
|
Song, X.; Lin, C.-Y.; Sun, M.-T. "Speech-Based Visual Concept Learning Using Wordnet", Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on Volume 6, Issue 6, Page(s):1138--1141, July 2005
|
| |
27
|
K. Riaz, "Challenges in Urdu Stemming" Future Directions in Information Access. Glasgow, August 2007
|
| |
28
|
K. Riaz, "Stop Word Identification in Urdu" ,Conference of Language and Technology, Bara Gali, Pakistan, August 2007
|
| |
29
|
|
 |
30
|
|
| |
31
|
P. Majumder, M. Mitra, S. K. Parui, P. Bhattacharyya. The First International Workshop on Evaluating Information Access (EVIA 2007) Tokyo, Japan, May 15, 2007.
|
| |
32
|
I. Moulinier, P. Jackson, "Natural Language Processing for Online Applications, Text Retrieval, Extraction and Categorization", 2 nd Edition, John Benjamins Publishing Company, 2007
|
| |
33
|
|
|