|
ABSTRACT
Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose Web search systems. Such classification becomes critical if the system must route queries to a subset of topic-specific and resource-constrained back-end databases. Successful query classification poses a challenging problem, as Web queries are short, thus providing few features. This feature sparseness, coupled with the constantly changing distribution and vocabulary of queries, hinders traditional text classification. We attack this problem by combining multiple classifiers, including exact lookup and partial matching in databases of manually classified frequent queries, linear models trained by supervised learning, and a novel approach based on mining selectional preferences from a large unlabeled query log. Our approach classifies queries without using external sources of information, such as online Web directories or the contents of retrieved pages, making it viable for use in demanding operational environments, such as large-scale Web search services. We evaluate our approach using a large sample of queries from an operational Web search engine and show that our combined method increases recall by nearly 40% over the best single method while maintaining adequate precision. Additionally, we compare our results to those from the 2005 KDD Cup and find that we perform competitively despite our operational restrictions. This suggests it is possible to topically classify a significant portion of the query stream without requiring external sources of information, allowing for deployment in operationally restricted environments.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
Steven M. Beitzel , Eric C. Jensen , Abdur Chowdhury , David Grossman , Ophir Frieder, Hourly analysis of a very large topically categorized web query log, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
[doi> 10.1145/1008992.1009048]
|
| |
4
|
Steven M. Beitzel , Eric C. Jensen , Ophir Frieder , David D. Lewis , Abdur Chowdhury , Aleksander Kolcz, Improving Automatic Query Classification via Semi-Supervised Learning, Proceedings of the Fifth IEEE International Conference on Data Mining, p.42-49, November 27-30, 2005
[doi> 10.1109/ICDM.2005.80]
|
 |
5
|
Steven M. Beitzel , Eric C. Jensen , Ophir Frieder , David Grossman , David D. Lewis , Abdur Chowdhury , Aleksandr Kolcz, Automatic web query classification using labeled and unlabeled training data, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
[doi> 10.1145/1076034.1076138]
|
| |
6
|
|
 |
7
|
|
 |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
Craswell, N. and Hawking, D. 2004. Overview of the trec 2004 Web track. In Proceedings of the Thirteenth Text Retrieval Conference (TREC 2004). NIST, Gaithersburg, MD. 89--97.
|
| |
13
|
Craswell, N., Hawking, D., Wilkinson, R., and Wu, M. 2003. Overview of the TREC 2003 Web track. In Proceedings of the Twelfth Text Retrieval Conference (TREC 2003). NIST, Gaithersburg, MD, 78--92.
|
 |
14
|
|
 |
15
|
|
 |
16
|
Eric J. Glover , Steve Lawrence , William P. Birmingham , C. Lee Giles, Architecture of a metasearch engine that supports user information needs, Proceedings of the eighth international conference on Information and knowledge management, p.210-216, November 02-06, 1999, Kansas City, Missouri, United States
[doi> 10.1145/319950.319980]
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
 |
22
|
|
 |
23
|
|
| |
24
|
Kowalczyk, P., Zukerman, I., and Niemann, M. 2004. Analyzing the effect of query class on document retrieval performance. In 17th Australian Joint Conference on Artificial Intelligence (AI-04). Springer-Verlag, Berlin, Germany, 550--561.
|
| |
25
|
Krauth, W. and Mezard, M. 1987. Learning algorithms with optimal stability in neural networks. J. Phys. A 20, 745--752.
|
| |
26
|
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web Science, 98--100.
|
 |
27
|
|
| |
28
|
|
| |
29
|
Light, M. and Greiff, W. 2002. Statistical models for the induction and use of selectional preferences. Cog. Sci. 26, 3 269--281.
|
 |
30
|
|
| |
31
|
|
| |
32
|
Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. 1997. The DET curve in assessment of detection task performance. In Proceedings of the 5th ESCA Conference on Speech Communication and Technology (Eurospeech '97), (Sept.). 1895--1898.
|
| |
33
|
|
| |
34
|
|
 |
35
|
|
| |
36
|
|
 |
37
|
|
 |
38
|
|
 |
39
|
Dou Shen , Rong Pan , Jian-Tao Sun , Jeffrey Junfeng Pan , Kangheng Wu , Jie Yin , Qiang Yang, Q2C@UST: our winning solution to query classification in KDDCUP 2005, ACM SIGKDD Explorations Newsletter, v.7 n.2, p.100-110, December 2005
[doi> 10.1145/1117454.1117467]
|
| |
40
|
|
| |
41
|
|
 |
42
|
|
| |
43
|
|
| |
44
|
Sullivan, D. 2006. Searches per day. Search Engine Watch. Go online to http://searchenginewatch.com/reports/article.php/2156461.
|
| |
45
|
Tague, J. M. 1981. The pragmatics of information retrieval experimentation. In Information Retrieval Experiment, K. S. Jones, Ed. Butterworth-Heinemann, London, U.K. 59--102.
|
| |
46
|
|
 |
47
|
David Vogel , Steffen Bickel , Peter Haider , Rolf Schimpfky , Peter Siemen , Steve Bridges , Tobias Scheffer, Classifying search engine queries using the web as background knowledge, ACM SIGKDD Explorations Newsletter, v.7 n.2, p.117-122, December 2005
[doi> 10.1145/1117454.1117469]
|
| |
48
|
Voorhees, E. M. 2004. Overview of the TREC 2004 question answering track. In Proceedings of the Thirteenth Text Retrieval Conference (TREC 2004, Nov.). NIST, Gaitheraburg, MD.
|
 |
49
|
|
 |
50
|
|
 |
51
|
|
CITED BY 6
|
|
Bernard J. Jansen , Danielle L. Booth , Amanda Spink, Determining the informational, navigational, and transactional intent of Web queries, Information Processing and Management: an International Journal, v.44 n.3, p.1251-1266, May, 2008
|
|
|
|
|
|
|
|
|
Evgeniy Gabrilovich , Andrei Broder , Marcus Fontoura , Amruta Joshi , Vanja Josifovski , Lance Riedel , Tong Zhang, Classifying search queries using the Web as a source of knowledge, ACM Transactions on the Web (TWEB), v.3 n.2, p.1-28, April 2009
|
|
|
|
|
|
Jaime Arguello , Fernando Diaz , Jamie Callan , Jean-Francois Crespo, Sources of evidence for vertical selection, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|