| Classifiers without borders: incorporating fielded text from neighboring web pages |
| Full text |
Pdf
(423 KB)
|
Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Singapore, Singapore
SESSION: Text classification
table of contents
Pages 643-650
Year of Publication: 2008
ISBN:978-1-60558-164-4
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 19, Downloads (12 Months): 244, Citation Count: 0
|
|
|
ABSTRACT
Accurate web page classification often depends crucially on information gained from neighboring pages in the local web graph. Prior work has exploited the class labels of nearby pages to improve performance. In contrast, in this work we utilize a weighted combination of the contents of neighbors to generate a better virtual document for classification. In addition, we break pages into fields, finding that a weighted combination of text from the target and fields of neighboring pages is able to reduce classification error by more than a third. We demonstrate performance on a large dataset of pages from the Open Directory Project and validate the approach using pages from a crawl from the Stanford WebBase. Interestingly, we find no value in anchor text and unexpected value in page titles (and especially titles of parent pages) in the virtual document.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
G. Attardi, A. Gulli, and F. Sebastiani. Automatic web page categorization by link and context analysis. In Proc. of the European Symposium on Telematics, Hypermedia and Artificial Intelligence (THAI), pages 105--119, 1999.
|
 |
3
|
|
 |
4
|
Pável Calado , Marco Cristo , Edleno Moura , Nivio Ziviani , Berthier Ribeiro-Neto , Marcos André Gonçalves, Combining link-based and content-based methods for web document classification, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956938]
|
 |
5
|
Carlos Castillo , Debora Donato , Aristides Gionis , Vanessa Murdock , Fabrizio Silvestri, Know your neighbors: web spam detection using the web topology, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277814]
|
 |
6
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
 |
7
|
Soumen Chakrabarti , Mukul M. Joshi , Kunal Punera , David M. Pennock, The structure of broad topics on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511480]
|
| |
8
|
|
 |
9
|
Junghoo Cho , Hector Garcia-Molina , Taher Haveliwala , Wang Lam , Andreas Paepcke , Sriram Raghavan , Gary Wesley, Stanford WebBase components and applications, ACM Transactions on Internet Technology (TOIT), v.6 n.2, p.153-186, May 2006
[doi> 10.1145/1149121.1149124]
|
 |
10
|
Junghoo Cho , Hector Garcia-Molina , Taher Haveliwala , Wang Lam , Andreas Paepcke , Sriram Raghavan , Gary Wesley, Stanford WebBase components and applications, ACM Transactions on Internet Technology (TOIT), v.6 n.2, p.153-186, May 2006
[doi> 10.1145/1149121.1149124]
|
 |
11
|
|
| |
12
|
|
| |
13
|
J. Fürnkranz. Hyperlink ensembles: A case study in hypertext classification. Journal of Information Fusion, 1:299--312, 2001.
|
| |
14
|
|
 |
15
|
Eric J. Glover , Kostas Tsioutsiouliklis , Steve Lawrence , David M. Pennock , Gary W. Flake, Using web structure for classifying and describing web pages, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511520]
|
| |
16
|
K. Golub and A. Ardo. Importance of HTML structural elements and metadata in automated subject classification. In Proc. of the 9th European Conf. on Research and Advanced Technology for Digital Lib. (ECDL), pages 368--378, 2005.
|
 |
17
|
|
| |
18
|
|
| |
19
|
Q. Lu and L. Getoor. Link-based classification. In Proc. of the 20th Int'l Conf. on Machine Learning (ICML), Menlo Park, CA, Aug. 2003. AAAI Press.
|
 |
20
|
|
 |
21
|
|
| |
22
|
|
 |
23
|
|
| |
24
|
M. Richardson and P. Domingos. The Intelligent Surfer: Probabilistic combination of link and content information in PageRank. In Advances in Neural Information Processing Systems 14. MIT Press, 2002.
|
 |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
|
 |
29
|
|
 |
30
|
|
| |
31
|
H. Utard and J. Fürnkranz. Link-local features for hypertext classification. In Semantics, Web and Mining: Joint International Workshops, EWMF/KDO, volume 4289 of LNCS, pages 51--64, Berlin, Oct. 2005. Springer.
|
 |
32
|
Xiangye Xiao , Qiong Luo , Xing Xie , Wei-Ying Ma, A comparative study on classifying the functions of web page blocks, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
[doi> 10.1145/1183614.1183725]
|
| |
33
|
|
 |
34
|
|
|