ACM Home Page
Please provide us with feedback. Feedback
Automatic classification of Web queries using very large unlabeled query logs
Full text PdfPdf (375 KB)
Source
ACM Transactions on Information Systems (TOIS) archive
Volume 25 ,  Issue 2  (April 2007) table of contents
Article No. 9  
Year of Publication: 2007
ISSN:1046-8188
Authors
Steven M. Beitzel  Illinois Institute of Technology, Chicago, IL
Eric C. Jensen  Illinois Institute of Technology, Chicago, IL
David D. Lewis  David D. Lewis Consulting, Chicago, IL
Abdur Chowdhury  Illinois Institute of Technology, Chicago, IL
Ophir Frieder  Illinois Institute of Technology, Chicago, IL
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 44,   Downloads (12 Months): 300,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1229179.1229183
What is a DOI?

ABSTRACT

Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose Web search systems. Such classification becomes critical if the system must route queries to a subset of topic-specific and resource-constrained back-end databases. Successful query classification poses a challenging problem, as Web queries are short, thus providing few features. This feature sparseness, coupled with the constantly changing distribution and vocabulary of queries, hinders traditional text classification. We attack this problem by combining multiple classifiers, including exact lookup and partial matching in databases of manually classified frequent queries, linear models trained by supervised learning, and a novel approach based on mining selectional preferences from a large unlabeled query log. Our approach classifies queries without using external sources of information, such as online Web directories or the contents of retrieved pages, making it viable for use in demanding operational environments, such as large-scale Web search services. We evaluate our approach using a large sample of queries from an operational Web search engine and show that our combined method increases recall by nearly 40% over the best single method while maintaining adequate precision. Additionally, we compare our results to those from the 2005 KDD Cup and find that we perform competitively despite our operational restrictions. This suggests it is possible to topically classify a significant portion of the query stream without requiring external sources of information, allowing for deployment in operationally restricted environments.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
5
 
6
7
8
 
9
10
 
11
 
12
Craswell, N. and Hawking, D. 2004. Overview of the trec 2004 Web track. In Proceedings of the Thirteenth Text Retrieval Conference (TREC 2004). NIST, Gaithersburg, MD. 89--97.
 
13
Craswell, N., Hawking, D., Wilkinson, R., and Wu, M. 2003. Overview of the TREC 2003 Web track. In Proceedings of the Twelfth Text Retrieval Conference (TREC 2003). NIST, Gaithersburg, MD, 78--92.
14
15
16
17
 
18
 
19
 
20
 
21
22
23
 
24
Kowalczyk, P., Zukerman, I., and Niemann, M. 2004. Analyzing the effect of query class on document retrieval performance. In 17th Australian Joint Conference on Artificial Intelligence (AI-04). Springer-Verlag, Berlin, Germany, 550--561.
 
25
Krauth, W. and Mezard, M. 1987. Learning algorithms with optimal stability in neural networks. J. Phys. A 20, 745--752.
 
26
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web Science, 98--100.
27
 
28
 
29
Light, M. and Greiff, W. 2002. Statistical models for the induction and use of selectional preferences. Cog. Sci. 26, 3 269--281.
30
 
31
 
32
Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. 1997. The DET curve in assessment of detection task performance. In Proceedings of the 5th ESCA Conference on Speech Communication and Technology (Eurospeech '97), (Sept.). 1895--1898.
 
33
 
34
35
 
36
37
38
39
 
40
 
41
42
 
43
 
44
Sullivan, D. 2006. Searches per day. Search Engine Watch. Go online to http://searchenginewatch.com/reports/article.php/2156461.
 
45
Tague, J. M. 1981. The pragmatics of information retrieval experimentation. In Information Retrieval Experiment, K. S. Jones, Ed. Butterworth-Heinemann, London, U.K. 59--102.
 
46
47
 
48
Voorhees, E. M. 2004. Overview of the TREC 2004 question answering track. In Proceedings of the Thirteenth Text Retrieval Conference (TREC 2004, Nov.). NIST, Gaitheraburg, MD.
49
50
51


Collaborative Colleagues:
Steven M. Beitzel: colleagues
Eric C. Jensen: colleagues
David D. Lewis: colleagues
Abdur Chowdhury: colleagues
Ophir Frieder: colleagues