ACM Home Page
Please provide us with feedback. Feedback
Knowing a web page by the company it keeps
Full text PdfPdf (501 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the 15th ACM international conference on Information and knowledge management table of contents
Arlington, Virginia, USA
SESSION: Classification - 1 table of contents
Pages: 228 - 237  
Year of Publication: 2006
ISBN:1-59593-433-2
Authors
Xiaoguang Qi  Lehigh University, Bethlehem, PA
Brian D. Davison  Lehigh University, Bethlehem, PA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 91,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183614.1183650
What is a DOI?

ABSTRACT

Web page classification is important to many tasks in information retrieval and web mining. However, applying traditional textual classifiers on web data often produces unsatisfying results. Fortunately, hyperlink information provides important clues to the categorization of a web page. In this paper, an improved method is proposed to enhance web page classification by utilizing the class information from neighboring pages in the link graph. The categories represented by four kinds of neighbors (parents, children, siblings and spouses) are combined to help with the page in question. In experiments to study the effect of these factors on our algorithm, we find that the method proposed is able to boost the classification accuracy of common textual classifiers from around 70% to more than 90% on a large dataset of pages from the Open Directory Project, and outperforms existing algorithms. Unlike prior techniques, our approach utilizes same-host links and can improve classification accuracy even when neighboring pages are unlabeled. Finally, while all neighbor types can contribute, sibling pages are found to be the most important.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
G. Attardi, A. Gulli, and F. Sebastiani. Automatic web page categorization by link and context analysis. In C. Hutchison and G. Lanzarone, editors, Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, pages 105--119, Varese, IT, 1999.
2
 
3
4
5
 
6
 
7
C. Chekuri, M. H. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. In Proceedings of the 6th International World Wide Web Conference, 1996.
8
9
10
 
11
 
12
N. Holden and A. A. Freitas. Web page classification with an ant colony algorithm. In Parallel Problem Solving from Nature - PPSN VIII, LNCS 3242, pages 1092--1102. Springer-Verlag, Sept. 2004.
 
13
14
15
 
16
M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic. Visual adjacency multigraphs - a novel approach to web page classification. In Proceedings of SAWM04 workshop, ECML2004, 2004.
 
17
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.umass.edu/~mccallum/bow/, 1996.
18
 
19
Open Directory Project (ODP), 2006. http://www.dmoz.com/.
 
20
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Unpublished draft, 1998.
 
21
M. Richardson and P. Domingos. The Intelligent Surfer: Probabilistic combination of link and content information in PageRank. In Advances in Neural Information Processing Systems 14. MIT Press, 2002.
22
 
23
24
 
25
Yahoo!, Inc. Yahoo! http://www.yahoo.com/, 2006.
 
26
Yahoo!, Inc. Yahoo! developer network. http://developer.yahoo.com/, 2006.
 
27


Collaborative Colleagues:
Xiaoguang Qi: colleagues
Brian D. Davison: colleagues