ACM Home Page
Please provide us with feedback. Feedback
A survey of Web clustering engines
Full text PdfPdf (3.52 MB)
Source
ACM Computing Surveys (CSUR) archive
Volume 41 ,  Issue 3  (July 2009) table of contents
Article No. 17  
Year of Publication: 2009
ISSN:0360-0300
Authors
Claudio Carpineto  Fondazione Ugo Bordoni, Roma, Italy
Stanislaw Osiński  Carrot Search
Giovanni Romano  Fondazione Ugo Bordoni, Roma, Italy
Dawid Weiss  Poznan University of Technology, Poznan, Poland
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 367,   Downloads (12 Months): 1153,   Citation Count: 0
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1541880.1541884
What is a DOI?

ABSTRACT

Web clustering engines organize search results by topic, thus offering a complementary view to the flat-ranked list returned by conventional search engines. In this survey, we discuss the issues that must be addressed in the development of a Web clustering engine, including acquisition and preprocessing of search results, their clustering and visualization. Search results clustering, the core of the system, has specific requirements that cannot be addressed by classical clustering algorithms. We emphasize the role played by the quality of the cluster labels as opposed to optimizing only the clustering structure. We highlight the main characteristics of a number of existing Web clustering engines and also discuss how to evaluate their retrieval performance. Some directions for future research are finally presented.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Abney, S. 1991. Parsing by Chunks. In Principle-Based Parsing: Computation and Psycholinguistics, R. C. Berwick, S. P. Abney, and C. Tenny, Eds. Kluwer Academic Publishers, 257--278.
2
3
4
 
5
6
 
7
Carpineto, C., Della Pietra, A., Mizzaro, S., and Romano, G. 2006. Mobile Clustering Engine. In Proceedings of the 28th European Conference on Information Retrieval. Lecture Notes in Computer Science, vol. 3936. Springer, 155--166.
 
8
 
9
 
10
Carpineto, C. and Romano, G. 2004b. Exploiting the potential of concept lattices for information retrieval with CREDO. J. Univ. Comput. Sci. 10, 8, 985--1013.
11
 
12
13
14
 
15
Cigarrán, J., Peñas, A., Gonzalo, J., and Verdejo, F. 2005. Evaluating hierarchical clustering of search results. In Proceedings of the 12th International Conference on String Processing and Information Retrieval (SPIRE). Springer, 49--54.
 
16
Cole, R., Eklund, P., and Stumme, G. 2003. Document retrieval for email search and discovery using formal concept analysis. Appl. Artif. Intell. 17, 3, 257--280.
17
18
 
19
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, T. K. 1990. Indexing by latent semanic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.
 
20
21
 
22
Dom, B. E. 2001. An information-theoretic external cluster-validity measure. Tech. rep. RJ-10219, IBM.
 
23
Dong, Z. 2002. Towards Web Information Clustering. Ph.D. thesis, Southeast University, Nanjing, China.
 
24
25
 
26
 
27
28
 
29
 
30
Gabrilovich, E. 2006. Feature generation for textual information retrieval using world knowledge. Ph.D. thesis, Technion—Israel Institute of Technology, Haifa, Israel.
 
31
 
32
Geraci, F., Maggini, M., Pellegrini, M., and Sebastiani, F. 2008. Cluster generation and cluster labelling for Web snippets: A fast and accurate hierarchical solution. Internet Math. 3, 4, 413--443.
 
33
Giannotti, F., Nanni, M., Pedreschi, D., and Samaritani, F. 2003. WebCat: Automatic categorization of Web search results. In Proceedings of the 11th Italian Symposium on Advanced Database Systems (SEBD), S. Flesca, S. Greco, D. Saccà, and E. Zumpano, Eds. Rubettino Editore, 507--518.
 
34
Grefenstette, G. 1995. Comparing two language identification schemes. In Proceedings of the 3rd International Conference on Statistical Analysis of Textual Data (JADT'95). 263--268.
 
35
Haase, P., Hotho, A., Schmidt-Thieme, L., and Sure, Y. 2005. Collaborative and usage-driven evolution of personal ontologies. In Proceedings of the 2nd European Semantic Web Conference. Springer, 486--499.
 
36
37
 
38
39
40
41
 
42
 
43
Hotho, A., Staab, S., and Stumme, G. 2003. Explaining text clustering results using semantic structures. In Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 2838. Springer, 217--228.
 
44
Husek, D., Pokorny, J., Rezankova, H., and Snasel, V. 2006. Data clustering: From documents to the Web. In Web Data Management Practices: Emerging Techniques and Technologies, A. Vakali and G. Pallis, Eds. Baker and Taylor, 1--33.
 
45
46
 
47
48
49
50
51
 
52
53
54
55
56
 
57
Leuski, A. and Croft, B. W. 1996. An evaluation of techniques for clustering search results. Tech. rep. IR-76, University of Massachusetts, Amherst.
 
58
 
59
Liu, T., Liu, S., Chen, Z., and Ma, W.-Y. 2003. An evaluation on feature selection for text clustering. In Proceedings of the 20th International Conference on Machine Learning, August 21--24, T. Fawcett and N. Mishra, Eds. AAAI Press, 488--495.
60
61
 
62
Maarek, Y. S., Fagin, R., Ben-Shaul, I. Z., and Pelleg, D. 2000. Ephemeral document clustering for Web applications. Tech. rep. RJ 10186, IBM Research.
 
63
 
64
 
65
 
66
Maslowska, I. 2003. Phrase-based hierarchical clustering of Web search results. In Proceedings of the 25th European Conference on IR Research, (ECIR). Lecture Notes in Computer Science, vol. 2633. Springer, 555--562.
67
 
68
 
69
Osdin, R., Ounis, I., and White, R. W. 2002. Using hierarchical clustering and summarisation approaches for Web retrieval. In Proceedings of the 11th Text REtrieval Conference (TREC). National Institute of Standards and Technology (NIST).
 
70
Osiński, S. 2006. Improving quality of search results clustering with approximate matrix factorisations. In Proceedings of the 28th European Conference on Information Retrieval. Lecture Notes in Computer Science, vol. 3936. Springer, 167--178.
 
71
Osiński, S., Stefanowski, J., and Weiss, D. 2004. Lingo: Search results clustering algorithm based on singular value decomposition. In Proceedings of the International Intelligent Information Processing and Web Mining Conference. Advances in Soft Computing. Springer, 359--368.
 
72
73
74
 
75
 
76
Pierrakos, D. and Paliouras, G. 2005. Exploiting probabilistic latent information for the construction of community Web directories. In Proceedings of the 10th International Conference on User Modeling. Springer, 89--98.
 
77
78
 
79
Rivadeneira, W. and Bederson, B. B. 2003. A study of search result clustering interfaces: Comparing textual and zoomable user interfaces. Tech. rep. HCIL-TR-2003-36, University of Maryland.
 
80
81
82
83
84
85
86
87
 
88
Spiliopoulou, M., Schaal, M., Müller, R. M., and Brunzel, M. 2005. Evaluation of ontology enhancement tools. In Proceedings of the Semantics, Web and Mining, Joint International Workshops, EWMF and KDO. Lecture Notes in Computer Science, vol. 4289. Springer, 132--146.
 
89
 
90
Stefanowski, J. and Weiss, D. 2003a. Carrot2 and language properties in Web search results clustering. In Proceedings of the 1st International Atlantic Web Intelligence Conference. Lecture Notes in Computer Science, vol. 2663. Springer, 240--249.
 
91
Stefanowski, J. and Weiss, D. 2003b. Web search results clustering in Polish: Experimental Evaluation of Carrot. In Proceedings of the International Intelligent Information Processing and Web Mining Conference. Advances in Soft Computing. Springer, 209--218.
 
92
Stein, B. and Meyer zu Eissen, S. 2004. Topic identification: Framework and application. In Proceedings of the 4th International Conference on Knowledge Management. 353--360.
 
93
Stein, B., Meyer zu Eissen, S., and Wibrock, F. 2003. On cluster validity and the information need of users. In Proceedings of the 3rd IASTED International Conference on Artificial Intelligence and Applications (AIA). Springer, 216--221.
 
94
Tagarelli, A. and Greco, S. 2006. Toward semantic XML clustering. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM). 188--199.
95
96
 
97
 
98
 
99
Ukkonen, E. 1995. On-line construction of suffix trees. Algorithmica 14, 3, 249--260.
 
100
 
101
102
 
103
 
104
Weiss, D. 2006. Descriptive clustering as a method for exploring text collections. Ph.D. thesis, Poznan University of Technology, Poznań, Poland.
 
105
 
106
 
107
108
 
109
110
 
111
Zhang, D. and Dong, Y. 2004. Semantic, hierarchical, online clustering of Web search results. In Proceedings of 6th Asia-Pacific Web Conference (APWeb). Lecture Notes in Computer Science, vol. 3007. Springer, 69--78.
 
112
113


Collaborative Colleagues:
Claudio Carpineto: colleagues
Stanislaw Osiński: colleagues
Giovanni Romano: colleagues
Dawid Weiss: colleagues