ACM Home Page
Please provide us with feedback. Feedback
Purely URL-based topic classification
Full text PdfPdf (534 KB)
Source
International World Wide Web Conference archive
Proceedings of the 18th international conference on World wide web table of contents
Madrid, Spain
POSTER SESSION: Wednesday, April 22, 2009 table of contents
Pages 1109-1110  
Year of Publication: 2009
ISBN:978-1-60558-487-4
Authors
Eda Baykan  Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Monika Henzinger  Ecole Polytechnique Fédérale de Lausanne & Google Züürich, Lausanne, Switzerland
Ludmila Marian  Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Ingmar Weber  Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 25,   Downloads (12 Months): 116,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1526709.1526880
What is a DOI?

ABSTRACT

Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objection-able) web page is downloaded, (iii) when a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.



Collaborative Colleagues:
Eda Baykan: colleagues
Monika Henzinger: colleagues
Ludmila Marian: colleagues
Ingmar Weber: colleagues