| Fast webpage classification using URL features |
| Full text |
Pdf
(183 KB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the 14th ACM international conference on Information and knowledge management
table of contents
Bremen, Germany
POSTER SESSION: Poster Session
table of contents
Pages: 325 - 326
Year of Publication: 2005
ISBN:1-59593-140-6
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 17, Downloads (12 Months): 138, Citation Count: 7
|
|
|
ABSTRACT
We demonstrate the usefulness of the uniform resource locator (URL) alone in performing web page classification. This approach is faster than typical web page classification, as the pages do not have to be fetched and analyzed. Our approach segments the URL into meaningful chunks and adds component, sequential and orthographic features to model salient patterns. The resulting features are used in supervised maximum entropy modeling. We analyze our approach's effectiveness on two standardized domains. Our results show that in certain scenarios, URL-based methods approach the performance of current state-of-the-art full-text and link-based methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
M.-Y. Kan and H. O. Nguyen Thi. Fast Webpage Classification Using URL Features. NUS Tech. Rpt. TRC 8/05.
|
 |
3
|
|
| |
4
|
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999.
|
 |
5
|
|
| |
6
|
|
 |
7
|
|
CITED BY 7
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Max Hinne , Wessel Kraaij , Stephan Raaijmakers , Suzan Verberne , Theo van der Weide , Maarten van der Heijden, Annotation of URLs: more than the sum of parts, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|