|
ABSTRACT
Web clustering engines organize search results by topic, thus offering a complementary view to the flat-ranked list returned by conventional search engines. In this survey, we discuss the issues that must be addressed in the development of a Web clustering engine, including acquisition and preprocessing of search results, their clustering and visualization. Search results clustering, the core of the system, has specific requirements that cannot be addressed by classical clustering algorithms. We emphasize the role played by the quality of the cluster labels as opposed to optimizing only the clustering structure. We highlight the main characteristics of a number of existing Web clustering engines and also discuss how to evaluate their retrieval performance. Some directions for future research are finally presented.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Abney, S. 1991. Parsing by Chunks. In Principle-Based Parsing: Computation and Psycholinguistics, R. C. Berwick, S. P. Abney, and C. Tenny, Eds. Kluwer Academic Publishers, 257--278.
|
 |
2
|
Robert B. Allen , Pascal Obry , Michael Littman, An interface for navigating clustered document sets returned by queries, Proceedings of the conference on Organizational computing systems, p.166-171, November 01-04, 1993, Milpitas, California, United States
[doi> 10.1145/168555.168572]
|
 |
3
|
|
 |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
Carpineto, C., Della Pietra, A., Mizzaro, S., and Romano, G. 2006. Mobile Clustering Engine. In Proceedings of the 28th European Conference on Information Retrieval. Lecture Notes in Computer Science, vol. 3936. Springer, 155--166.
|
| |
8
|
|
| |
9
|
|
| |
10
|
Carpineto, C. and Romano, G. 2004b. Exploiting the potential of concept lattices for information retrieval with CREDO. J. Univ. Comput. Sci. 10, 8, 985--1013.
|
 |
11
|
|
| |
12
|
|
 |
13
|
|
 |
14
|
|
| |
15
|
Cigarrán, J., Peñas, A., Gonzalo, J., and Verdejo, F. 2005. Evaluating hierarchical clustering of search results. In Proceedings of the 12th International Conference on String Processing and Information Retrieval (SPIRE). Springer, 49--54.
|
| |
16
|
Cole, R., Eklund, P., and Stumme, G. 2003. Document retrieval for email search and discovery using formal concept analysis. Appl. Artif. Intell. 17, 3, 257--280.
|
 |
17
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
 |
18
|
|
| |
19
|
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, T. K. 1990. Indexing by latent semanic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.
|
| |
20
|
|
 |
21
|
Li Ding , Tim Finin , Anupam Joshi , Rong Pan , R. Scott Cost , Yun Peng , Pavan Reddivari , Vishal Doshi , Joel Sachs, Swoogle: a search and metadata engine for the semantic web, Proceedings of the thirteenth ACM international conference on Information and knowledge management, November 08-13, 2004, Washington, D.C., USA
[doi> 10.1145/1031171.1031289]
|
| |
22
|
Dom, B. E. 2001. An information-theoretic external cluster-validity measure. Tech. rep. RJ-10219, IBM.
|
| |
23
|
Dong, Z. 2002. Towards Web Information Clustering. Ph.D. thesis, Southeast University, Nanjing, China.
|
| |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
|
 |
28
|
|
| |
29
|
|
| |
30
|
Gabrilovich, E. 2006. Feature generation for textual information retrieval using world knowledge. Ph.D. thesis, Technion—Israel Institute of Technology, Haifa, Israel.
|
| |
31
|
|
| |
32
|
Geraci, F., Maggini, M., Pellegrini, M., and Sebastiani, F. 2008. Cluster generation and cluster labelling for Web snippets: A fast and accurate hierarchical solution. Internet Math. 3, 4, 413--443.
|
| |
33
|
Giannotti, F., Nanni, M., Pedreschi, D., and Samaritani, F. 2003. WebCat: Automatic categorization of Web search results. In Proceedings of the 11th Italian Symposium on Advanced Database Systems (SEBD), S. Flesca, S. Greco, D. Saccà, and E. Zumpano, Eds. Rubettino Editore, 507--518.
|
| |
34
|
Grefenstette, G. 1995. Comparing two language identification schemes. In Proceedings of the 3rd International Conference on Statistical Analysis of Textual Data (JADT'95). 263--268.
|
| |
35
|
Haase, P., Hotho, A., Schmidt-Thieme, L., and Sure, Y. 2005. Collaborative and usage-driven evolution of personal ontologies. In Proceedings of the 2nd European Semantic Web Conference. Springer, 486--499.
|
| |
36
|
|
 |
37
|
|
| |
38
|
|
 |
39
|
|
 |
40
|
|
 |
41
|
|
| |
42
|
|
| |
43
|
Hotho, A., Staab, S., and Stumme, G. 2003. Explaining text clustering results using semantic structures. In Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 2838. Springer, 217--228.
|
| |
44
|
Husek, D., Pokorny, J., Rezankova, H., and Snasel, V. 2006. Data clustering: From documents to the Web. In Web Data Management Practices: Emerging Techniques and Technologies, A. Vakali and G. Pallis, Eds. Baker and Taylor, 1--33.
|
| |
45
|
|
 |
46
|
|
| |
47
|
|
 |
48
|
|
 |
49
|
|
 |
50
|
Amy K. Karlson , George G. Robertson , Daniel C. Robbins , Mary P. Czerwinski , Greg R. Smith, FaThumb: a facet-based interface for mobile search, Proceedings of the SIGCHI conference on Human Factors in computing systems, April 22-27, 2006, Montréal, Québec, Canada
[doi> 10.1145/1124772.1124878]
|
 |
51
|
|
| |
52
|
|
 |
53
|
Krishna Kummamuru , Rohit Lotlikar , Shourya Roy , Karan Singal , Raghu Krishnapuram, A hierarchical monothetic document clustering algorithm for summarization and browsing search results, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988762]
|
 |
54
|
|
 |
55
|
|
 |
56
|
|
| |
57
|
Leuski, A. and Croft, B. W. 1996. An evaluation of techniques for clustering search results. Tech. rep. IR-76, University of Massachusetts, Amherst.
|
| |
58
|
|
| |
59
|
Liu, T., Liu, S., Chen, Z., and Ma, W.-Y. 2003. An evaluation on feature selection for text clustering. In Proceedings of the 20th International Conference on Machine Learning, August 21--24, T. Fawcett and N. Mishra, Eds. AAAI Press, 488--495.
|
 |
60
|
|
 |
61
|
|
| |
62
|
Maarek, Y. S., Fagin, R., Ben-Shaul, I. Z., and Pelleg, D. 2000. Ephemeral document clustering for Web applications. Tech. rep. RJ 10186, IBM Research.
|
| |
63
|
|
| |
64
|
|
| |
65
|
|
| |
66
|
Maslowska, I. 2003. Phrase-based hierarchical clustering of Web search results. In Proceedings of the 25th European Conference on IR Research, (ECIR). Lecture Notes in Computer Science, vol. 2633. Springer, 555--562.
|
 |
67
|
|
| |
68
|
|
| |
69
|
Osdin, R., Ounis, I., and White, R. W. 2002. Using hierarchical clustering and summarisation approaches for Web retrieval. In Proceedings of the 11th Text REtrieval Conference (TREC). National Institute of Standards and Technology (NIST).
|
| |
70
|
Osiński, S. 2006. Improving quality of search results clustering with approximate matrix factorisations. In Proceedings of the 28th European Conference on Information Retrieval. Lecture Notes in Computer Science, vol. 3936. Springer, 167--178.
|
| |
71
|
Osiński, S., Stefanowski, J., and Weiss, D. 2004. Lingo: Search results clustering algorithm based on singular value decomposition. In Proceedings of the International Intelligent Information Processing and Web Mining Conference. Advances in Soft Computing. Springer, 359--368.
|
| |
72
|
|
 |
73
|
|
 |
74
|
|
| |
75
|
|
| |
76
|
Pierrakos, D. and Paliouras, G. 2005. Exploiting probabilistic latent information for the construction of community Web directories. In Proceedings of the 10th International Conference on User Modeling. Springer, 89--98.
|
| |
77
|
|
 |
78
|
Maria Rigou , Spiros Sirmakessis , Giannis Tzimas, A method for personalized clustering in data intensive web applications, Proceedings of the joint international workshop on Adaptivity, personalization & the semantic web, p.35-40, August 23-23, 2006, Odense, Denmark
[doi> 10.1145/1149933.1149939]
|
| |
79
|
Rivadeneira, W. and Bederson, B. B. 2003. A study of search result clustering interfaces: Comparing textual and zoomable user interfaces. Tech. rep. HCIL-TR-2003-36, University of Maryland.
|
| |
80
|
|
 |
81
|
Kerry Rodden , Wojciech Basalaj , David Sinclair , Kenneth Wood, Does organisation by similarity assist image browsing?, Proceedings of the SIGCHI conference on Human factors in computing systems, p.190-197, March 2001, Seattle, Washington, United States
[doi> 10.1145/365024.365097]
|
 |
82
|
|
 |
83
|
Nachiketa Sahoo , Jamie Callan , Ramayya Krishnan , George Duncan , Rema Padman, Incremental hierarchical clustering of text documents, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
[doi> 10.1145/1183614.1183667]
|
 |
84
|
|
 |
85
|
|
 |
86
|
|
 |
87
|
|
| |
88
|
Spiliopoulou, M., Schaal, M., Müller, R. M., and Brunzel, M. 2005. Evaluation of ontology enhancement tools. In Proceedings of the Semantics, Web and Mining, Joint International Workshops, EWMF and KDO. Lecture Notes in Computer Science, vol. 4289. Springer, 132--146.
|
| |
89
|
|
| |
90
|
Stefanowski, J. and Weiss, D. 2003a. Carrot2 and language properties in Web search results clustering. In Proceedings of the 1st International Atlantic Web Intelligence Conference. Lecture Notes in Computer Science, vol. 2663. Springer, 240--249.
|
| |
91
|
Stefanowski, J. and Weiss, D. 2003b. Web search results clustering in Polish: Experimental Evaluation of Carrot. In Proceedings of the International Intelligent Information Processing and Web Mining Conference. Advances in Soft Computing. Springer, 209--218.
|
| |
92
|
Stein, B. and Meyer zu Eissen, S. 2004. Topic identification: Framework and application. In Proceedings of the 4th International Conference on Knowledge Management. 353--360.
|
| |
93
|
Stein, B., Meyer zu Eissen, S., and Wibrock, F. 2003. On cluster validity and the information need of users. In Proceedings of the 3rd IASTED International Conference on Artificial Intelligence and Applications (AIA). Springer, 216--221.
|
| |
94
|
Tagarelli, A. and Greco, S. 2006. Toward semantic XML clustering. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM). 188--199.
|
 |
95
|
Jaime Teevan , Christine Alvarado , Mark S. Ackerman , David R. Karger, The perfect search engine is not enough: a study of orienteering behavior in directed search, Proceedings of the SIGCHI conference on Human factors in computing systems, p.415-422, April 24-29, 2004, Vienna, Austria
[doi> 10.1145/985692.985745]
|
 |
96
|
|
| |
97
|
|
| |
98
|
|
| |
99
|
Ukkonen, E. 1995. On-line construction of suffix trees. Algorithmica 14, 3, 249--260.
|
| |
100
|
|
| |
101
|
|
 |
102
|
|
| |
103
|
|
| |
104
|
Weiss, D. 2006. Descriptive clustering as a method for exploring text collections. Ph.D. thesis, Poznan University of Technology, Poznań, Poland.
|
| |
105
|
|
| |
106
|
|
| |
107
|
|
 |
108
|
|
| |
109
|
|
 |
110
|
Hua-Jun Zeng , Qi-Cai He , Zheng Chen , Wei-Ying Ma , Jinwen Ma, Learning to cluster web search results, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
[doi> 10.1145/1008992.1009030]
|
| |
111
|
Zhang, D. and Dong, Y. 2004. Semantic, hierarchical, online clustering of Web search results. In Proceedings of 6th Asia-Pacific Web Conference (APWeb). Lecture Notes in Computer Science, vol. 3007. Springer, 69--78.
|
| |
112
|
|
 |
113
|
Hongkun Zhao , Weiyi Meng , Zonghuan Wu , Vijay Raghavan , Clement Yu, Fully automatic wrapper generation for search engines, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
[doi> 10.1145/1060745.1060760]
|
|