ACM Home Page
Please provide us with feedback. Feedback
Clustering the tagged web
Full text PdfPdf (323 KB)
Source Web Search and Web Data Mining archive
Proceedings of the Second ACM International Conference on Web Search and Data Mining table of contents
Barcelona, Spain
SESSION: Classification and clustering table of contents
Pages 54-63  
Year of Publication: 2009
ISBN:978-1-60558-390-7
Authors
Daniel Ramage  Serra Mall, Stanford, CA
Paul Heymann  Serra Mall, Stanford, CA
Christopher D. Manning  Serra Mall, Stanford, CA
Hector Garcia-Molina  Serra Mall, Stanford, CA
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
: Google
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
: Yahoo! Research
Microsoft : Microsoft
: Nokia
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 69,   Downloads (12 Months): 489,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1498759.1498809
What is a DOI?

ABSTRACT

Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Open directory project. http://dmoz.org/.
 
2
M. Aurnhammer, P. Hanappe, and L. Steels. Integrating collaborative tagging and emergent semantics for image retrieval. Proc. of the Collaborative Web Tagging Workshop (WWW'06).
3
 
4
G. Begelman, P. Keller, and F. Smadja. Automated tag clustering: Improving search and exploration in the tag space. Proc. of the Collaborative Web Tagging Workshop (WWW'06).
5
 
6
B. Berendt and C. Hanser. Tags are not Metadata, but "Just More Content"--to Some People. ICWSM '07.
7
 
8
9
 
10
11
12
 
13
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.
14
 
15
Johannes Fürnkranz. Exploiting structural information for text classification on the WWW. In IDA '99.
 
16
T. L. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228--5235, '04.
17
18
 
19
C. Hayes and P. Avesani. Using tags and clustering to identify topic-relevant blogs. In ICWSM, 2007.
20
21
22
 
23
A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications, 4011:411--426, 2006.
 
24
T. Liu, S. Liu, Z. Chen, and W. Y. Ma. An evaluation on feature selection for text clustering. In ICML '03.
25
 
26
 
27
 
28
29
30
 
31
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In AAAI Workshop on AI for Web Search (AAAI 2000).
 
32
33
 
34
35
36
 
37
38
39
40


Collaborative Colleagues:
Daniel Ramage: colleagues
Paul Heymann: colleagues
Christopher D. Manning: colleagues
Hector Garcia-Molina: colleagues