ACM Home Page
Please provide us with feedback. Feedback
Coarse-grained classification of web sites by their structural properties
Full text PdfPdf (282 KB)
Source Workshop On Web Information And Data Management archive
Proceedings of the 8th annual ACM international workshop on Web information and data management table of contents
Arlington, Virginia, USA
SESSION: Web ranking and classification table of contents
Pages: 35 - 42  
Year of Publication: 2006
ISBN:1-59593-525-8
Authors
Christoph Lindemann  University of Leipzig, Leipzig, Germany
Lars Littig  University of Leipzig, Leipzig, Germany
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 69,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183550.1183559
What is a DOI?

ABSTRACT

In this paper, we identify and analyze structural properties which reflect the functionality of a Web site. These structural properties consider the size, the organization, the composition of URLs, and the link structure of Web sites. Opposed to previous work, we perform a comprehensive measurement study to delve into the relation between the structure and the functionality of Web sites. Our study focuses on five of the most relevant functional classes, namely Academic, Blog, Corporate, Personal, and Shop. It is based upon more than 1,400 Web sites composed of 7 million crawled and 47 million known Web pages. We present a detailed statistical analysis which provides insight into how structural properties can be used to distinguish between Web sites from different functional classes. Building on these results, we introduce a content-independent approach for the automated coarse-grained classification of Web sites. A naïve Bayesian classifier with advanced density estimation yields a precision of 82% and recall of 80% for the classification of Web sites into the considered classes.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
D. Bates and D. Watts, Nonlinear Regression and Its Applications, Wiley, 1988.
 
3
 
4
S. Chakrabarti, Mining the Web, Morgan Kaufmann, 2003.
 
5
6
 
7
DMOZ: open directory project, www.dmoz.org
 
8
 
9
10
11
 
12
W. Gao, T.-J. Huang, and Y-H. Tian, Two-phase Web Site Classification Based on Hidden Markov Tree Models, Web Intelligence and Agent Systems, 2004.
13
 
14
J. Kenney and E. Keeping, Root Mean Square, Mathematics of Statistics, Van Nostrand, 3rd Edition, 59--60, 1962.
 
15
 
16
J. M. Pierre, On the Automated Classification of Web Sites, Linköping Electronic Articles in Computer and Information Science, Sweden 6, 2001.
 
17
Yahoo! Mindset, http://mindset.research.yahoo.com
 
18
Y. Yang and G. Webb, Weighted Proportional k-Interval Discretization for Naive-Bayes Classifiers, Proc. Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Seoul, Korea, 2003.


Collaborative Colleagues:
Christoph Lindemann: colleagues
Lars Littig: colleagues