| Coarse-grained classification of web sites by their structural properties |
| Full text |
Pdf
(282 KB)
|
| Source
|
Workshop On Web Information And Data Management
archive
Proceedings of the 8th annual ACM international workshop on Web information and data management
table of contents
Arlington, Virginia, USA
SESSION: Web ranking and classification
table of contents
Pages: 35 - 42
Year of Publication: 2006
ISBN:1-59593-525-8
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 5, Downloads (12 Months): 72, Citation Count: 3
|
|
|
ABSTRACT
In this paper, we identify and analyze structural properties which reflect the functionality of a Web site. These structural properties consider the size, the organization, the composition of URLs, and the link structure of Web sites. Opposed to previous work, we perform a comprehensive measurement study to delve into the relation between the structure and the functionality of Web sites. Our study focuses on five of the most relevant functional classes, namely Academic, Blog, Corporate, Personal, and Shop. It is based upon more than 1,400 Web sites composed of 7 million crawled and 47 million known Web pages. We present a detailed statistical analysis which provides insight into how structural properties can be used to distinguish between Web sites from different functional classes. Building on these results, we introduce a content-independent approach for the automated coarse-grained classification of Web sites. A naïve Bayesian classifier with advanced density estimation yields a precision of 82% and recall of 80% for the classification of Web sites into the considered classes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
[doi> 10.1145/900051.900060]
|
| |
2
|
D. Bates and D. Watts, Nonlinear Regression and Its Applications, Wiley, 1988.
|
| |
3
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.309-320, June 2000, Amsterdam, The Netherlands
|
| |
4
|
S. Chakrabarti, Mining the Web, Morgan Kaufmann, 2003.
|
| |
5
|
|
 |
6
|
Stephen Dill , Ravi Kumar , Kevin S. Mccurley , Sridhar Rajagopalan , D. Sivakumar , Andrew Tomkins, Self-similarity in the web, ACM Transactions on Internet Technology (TOIT), v.2 n.3, p.205-223, August 2002
[doi> 10.1145/572326.572328]
|
| |
7
|
DMOZ: open directory project, www.dmoz.org
|
| |
8
|
|
| |
9
|
|
 |
10
|
Martin Ester , Hans-Peter Kriegel , Matthias Schubert, Web site mining: a new way to spot competitors, customers and suppliers in the world wide web, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
[doi> 10.1145/775047.775084]
|
 |
11
|
|
| |
12
|
W. Gao, T.-J. Huang, and Y-H. Tian, Two-phase Web Site Classification Based on Hidden Markov Tree Models, Web Intelligence and Agent Systems, 2004.
|
 |
13
|
|
| |
14
|
J. Kenney and E. Keeping, Root Mean Square, Mathematics of Statistics, Van Nostrand, 3rd Edition, 59--60, 1962.
|
| |
15
|
|
| |
16
|
J. M. Pierre, On the Automated Classification of Web Sites, Linköping Electronic Articles in Computer and Information Science, Sweden 6, 2001.
|
| |
17
|
Yahoo! Mindset, http://mindset.research.yahoo.com
|
| |
18
|
Y. Yang and G. Webb, Weighted Proportional k-Interval Discretization for Naive-Bayes Classifiers, Proc. Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Seoul, Korea, 2003.
|
|