| Purely URL-based topic classification |
| Full text |
Pdf
(534 KB)
|
Source
|
International World Wide Web Conference
archive
Proceedings of the 18th international conference on World wide web
table of contents
Madrid, Spain
POSTER SESSION: Wednesday, April 22, 2009
table of contents
Pages 1109-1110
Year of Publication: 2009
ISBN:978-1-60558-487-4
|
|
Authors
|
|
Eda Baykan
|
Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
|
|
Monika Henzinger
|
Ecole Polytechnique Fédérale de Lausanne & Google Züürich, Lausanne, Switzerland
|
|
Ludmila Marian
|
Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
|
|
Ingmar Weber
|
Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 25, Downloads (12 Months): 116, Citation Count: 0
|
|
|
ABSTRACT
Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objection-able) web page is downloaded, (iii) when a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
The 4 universities data set. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo--20/www/data/.
|
| |
2
|
Open directory project. http://www.dmoz.org/.
|
| |
3
|
E. Baykan, M. Henzinger, and I. Weber. Web page language identification based on urls. In International conference on Very Large Data Bases (VLDB), pages 176--187, 2008.
|
 |
4
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
 |
5
|
|
 |
6
|
|
 |
7
|
|
|