|
ABSTRACT
Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Open directory project. http://dmoz.org/.
|
| |
2
|
M. Aurnhammer, P. Hanappe, and L. Steels. Integrating collaborative tagging and emergent semantics for image retrieval. Proc. of the Collaborative Web Tagging Workshop (WWW'06).
|
 |
3
|
Shenghua Bao , Guirong Xue , Xiaoyuan Wu , Yong Yu , Ben Fei , Zhong Su, Optimizing web search using social annotations, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242640]
|
| |
4
|
G. Begelman, P. Keller, and F. Smadja. Automated tag clustering: Improving search and exploration in the tag space. Proc. of the Collaborative Web Tagging Workshop (WWW'06).
|
 |
5
|
Steven M. Beitzel , Eric C. Jensen , Abdur Chowdhury , David Grossman , Ophir Frieder, Hourly analysis of a very large topically categorized web query log, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
[doi> 10.1145/1008992.1009048]
|
| |
6
|
B. Berendt and C. Hanser. Tags are not Metadata, but "Just More Content"--to Some People. ICWSM '07.
|
 |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
 |
11
|
|
 |
12
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
| |
13
|
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.
|
 |
14
|
|
| |
15
|
Johannes Fürnkranz. Exploiting structural information for text classification on the WWW. In IDA '99.
|
| |
16
|
T. L. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228--5235, '04.
|
 |
17
|
|
 |
18
|
Taher H. Haveliwala , Aristides Gionis , Dan Klein , Piotr Indyk, Evaluating strategies for similarity search on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511502]
|
| |
19
|
C. Hayes and P. Avesani. Using tags and clustering to identify topic-relevant blogs. In ICWSM, 2007.
|
 |
20
|
|
 |
21
|
|
 |
22
|
|
| |
23
|
A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications, 4011:411--426, 2006.
|
| |
24
|
T. Liu, S. Liu, Z. Chen, and W. Y. Ma. An evaluation on feature selection for text clustering. In ICML '03.
|
 |
25
|
|
| |
26
|
|
| |
27
|
Kathleen R. McKeown , Regina Barzilay , David Evans , Vasileios Hatzivassiloglou , Judith L. Klavans , Ani Nenkova , Carl Sable , Barry Schiffman , Sergey Sigelman, Tracking and summarizing news on a daily basis with Columbia's Newsblaster, Proceedings of the second international conference on Human Language Technology Research, p.280-285, March 24-27, 2002, San Diego, California
|
| |
28
|
|
 |
29
|
|
 |
30
|
Kai Song , Yonghong Tian , Wen Gao , Tiejun Huang, Diversifying the image retrieval results, Proceedings of the 14th annual ACM international conference on Multimedia, October 23-27, 2006, Santa Barbara, CA, USA
[doi> 10.1145/1180639.1180789]
|
| |
31
|
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In AAAI Workshop on AI for Web Search (AAAI 2000).
|
| |
32
|
|
 |
33
|
|
| |
34
|
|
 |
35
|
|
 |
36
|
Yusuke Yanbe , Adam Jatowt , Satoshi Nakamura , Katsumi Tanaka, Can social bookmarking enhance search in the web?, Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
[doi> 10.1145/1255175.1255198]
|
| |
37
|
|
 |
38
|
|
 |
39
|
Hua-Jun Zeng , Qi-Cai He , Zheng Chen , Wei-Ying Ma , Jinwen Ma, Learning to cluster web search results, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
[doi> 10.1145/1008992.1009030]
|
 |
40
|
Ding Zhou , Jiang Bian , Shuyi Zheng , Hongyuan Zha , C. Lee Giles, Exploring social annotations for information retrieval, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
[doi> 10.1145/1367497.1367594]
|
CITED BY 2
|
|
|
|
|
Max Hinne , Wessel Kraaij , Stephan Raaijmakers , Suzan Verberne , Theo van der Weide , Maarten van der Heijden, Annotation of URLs: more than the sum of parts, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|