ACM Home Page
Please provide us with feedback. Feedback
Web page classification: Features and algorithms
Full text PdfPdf (339 KB)
Source
ACM Computing Surveys (CSUR) archive
Volume 41 ,  Issue 2  (February 2009) table of contents
Article No. 12  
Year of Publication: 2009
ISSN:0360-0300
Authors
Xiaoguang Qi  Lehigh University, Bethlehem, PA
Brian D. Davison  Lehigh University, Bethlehem, PA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 530,   Downloads (12 Months): 2570,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1459352.1459357
What is a DOI?

ABSTRACT

Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process.

As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Aas, K. and Eikvil, L. 1999. Text categorisation: A survey. Tech. rep. 941. Norwegian Computing Center, Oslo, Norway.
2
 
3
Amitay, E. 1998. Using common hypertext links to identify the best phrasal description of target Web documents. In Proceedings of the SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web (Melbourne, Australia).
4
5
6
 
7
Armstrong, R., Freitag, D., Joachims, T., and Mitchell, T. 1995. WebWatcher: A learning apprentice for the World Wide Web. In Proceedings of the AAAI Spring Symposium on Information Gathering from Distributed, Heterogeneous Environments. AAAI Press, Menlo Park, CA, 6--12.
 
8
Asirvatham, A. P. and Ravi, K. K. 2001. Web page classification based on document structure. Awarded second prize in National Level Student Paper Contest conducted by IEEE India Council.
 
9
Attardi, G., Gulli, A., and Sebastiani, F. 1999. Automatic Web page categorization by link and context analysis. In Proceedings of First European Symposium on Telematics, Hypermedia and Artificial Intelligence (THAI, Varese, Italy), C. Hutchison and G. Lanzarone, Eds., 105--119.
10
 
11
 
12
Berendt, B. and Hanser, C. 2007. Tags are not metadata, but “just more content”—to some people. In Proceedings of the International Conference on Weblogs and Social Media. 26--28.
13
14
15
16
17
18
 
19
Cardoso-Cachopo, A. and Oliveira, A. L. 2003. An empirical comparison of text categorization methods. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 2857. Springer, Berlin, Germany, 183--196.
20
21
 
22
23
24
 
25
 
26
Chekuri, C., Goldwasser, M., Raghavan, P., and Upfal, E. 1997. Web search using automated classification. In Proceedings of the Sixth International World Wide Web Conference (Santa Clara, CA). Poster POS725.
27
 
28
 
29
Chesley, P., Vincent, B., Xu, L., and Srihari, R. K. 2006. Using verbs and adjectives to automatically classify blog sentiment. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 27--29. Technical Report SS-06-03.
30
 
31
Choi, B. and Yao, Z. 2005. Web page classification. In Foundations and Advances in Data Mining, W. Chu and T. Y. Lin, Eds. Studies in Fuzziness and Soft Computing, vol. 180. Springer-Verlag, Berlin, Germany, 221--274.
 
32
Cohen, W. W. 2002. Improving a page classifier with anchor extraction and link analysis. In Advances in Neural Information Processing Systems, S. Becker, S. Thrun, and K. Obermayer, Eds. Vol. 15. MIT Press, Cambridge, MA, 1481--1488.
 
33
Cohn, D. and Hofmann, T. 2001. The missing link—a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems (NIPS), vol. 13. MIT Press, Cambridge, MA.
 
34
35
36
 
37
Davison, B. D. 2004. The potential of the metasearch engine. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology. Vol. 41. American Society for Information Science & Technology, Providence, RI, 393--402.
 
38
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.
 
39
Dietterich, T. G. and Bakiri, G. 1995. Solving multiclass learning problems via error-correcting output codes. J. Artic. Intell. Res. 2, 263--286.
40
 
41
Drost, I., Bickel, S., and Scheffer, T. 2005. Discovering communities in linked data by multi-view clustering. In From Data and Information Analysis to Knowledge Engineering: Proceedings of 29th Annual Conference of the German Classification Society. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Germany, 342--349.
 
42
Duda, R. O. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York, NY.
43
 
44
Elgersma, E. and de Rijke, M. 2006. Learning to recognize blogs: A preliminary exploration. In EACL Workshop: New Text—Wikis and blogs and other dynamic text sources.
45
 
46
Fisher, M. J. and Everson, R. M. 2003. When are links useful? Experiments in text classification. In Advances in Information Retrieval. Proceedings of the 25th European Conference on IR Research. 41--56.
47
 
48
 
49
Fürnkranz, J. 2001. Hyperlink ensembles: A case study in hypertext classification. J. Inform. Fus. 1, 299--312.
 
50
Fürnkranz, J. 2005. Web mining. In The Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Springer, Berlin, Germany, 899--920.
51
 
52
Gabrilovich, E. and Markovitch, S. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 19th International Joint Conference for Artificial Intelligence (IJCAI). 1048--1053.
 
53
Gabrilovich, E. and Markovitch, S. 2006. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 1301--1306.
 
54
55
 
56
 
57
 
58
 
59
Glance, N. S. 2000. Community search assistant. In Artificial Intelligence for Web Search. AAAI Press Mento Park, CA, 29--34. Presented at the AAAI-2000 Workshop on Artificial Intelligence for Web Search, Technical Rep. WS-00-01.
60
 
61
Golub, K. and Ardo, A. 2005. Importance of HTML structural elements and metadata in automated subject classification. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL). Lecture Notes in Computer Science, vol. 3652. Springer, Berlin, Germany, 368--378.
62
 
63
 
64
Gyöngyi, Z. and Garcia-Molina, H. 2005b. Web spam taxonomy. In Proceedings of the First International Workshop on Adversarial Information Retrieval (AIRWeb), B. D. Davison, Ed. Lehigh University, Department of Computer Science, Bethlehem, PA, 39--47. Technical rep. LU-CSE-05-030.
 
65
 
66
 
67
 
68
He, X., Zha, H., Ding, C. H. Q., and Simon, H. D. 2002. Web document clustering using hyperlink structures. Computat. Stat. Data Anal. 41, 1, 19--45.
 
69
70
71
72
73
 
74
75
76
 
77
 
78
Joachims, T., Freitag, D., and Mitchell, T. 1997. WebWatcher: A tour guide for the World Wide Web. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA, 770--775.
79
80
81
 
82
 
83
84
85
 
86
Kovacevic, M., Diligenti, M., Gori, M., and Milutinovic, V. 2004. Visual adjacency multigraphs—a novel approach for a Web page classification. In Proceedings of the Workshop on Statistical Approaches to Web Mining (SAWM). 38--49.
 
87
88
89
90
91
 
92
93
94
95
 
96
Liu, W., Xue, G.-R., Yu, Y., and Zeng, H.-J. 2005b. Importance-based Web page classification using cost-sensitive SVM. Adv. Web-Age Inform. Manage. 3739, 127--137.
 
97
Loia, V. and Senatore, S. 2006a. Personalized knowledge models using RDF-based fuzzy classification. Stud. Fuzz. Soft Comput. 197, 45--64.
 
98
Loia, V. and Senatore, S. 2006b. Proximity-based supervision for flexible Web pages categorization. In Fuzzy Logic and the Semantic Web, Elsevier, The Netherlands, 46--69.
 
99
Lu, Q. and Getoor, L. 2003. Link-based classification. In Proceedings of the 20th International Conference on Machine Learning (ICML). AAAI Press, Menlo Park, CA.
 
100
Luxenburger, J. and Weikum, G. 2004. Query-log based authority analysis for Web information search. In Proceedings of the 5th International Conference on Web Information Systems Engineering (WISE). Lecture Notes in Computer Science, vol. 3306. Springer, Berlin, Germany, 90--101.
 
101
102
 
103
 
104
Mihalcea, R. and Liu, H. 2006. A corpus-based approach to finding happiness. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 139--144. Tech. rep. SS-06-03.
 
105
Mishne, G. 2005. Experiments with mood classification in blog posts. In Proceedings of the Workshop on Stylistic Analysis of Text for Information Access.
 
106
Mishne, G. and de Rijke, M. 2006. Capturing global mood levels using blog posts. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 145--152. Tech. rep. SS-06-03.
 
107
 
108
Mladenic, D. 1998. Turning Yahoo into an automatic Web-page classifier. In Proceedings of the European Conference on Artificial Intelligence (ECAI). 473--474.
 
109
110
111
 
112
Netscape Communications Corporation. 2008. The dmoz Open Directory Project (ODP). http://www.dmoz.org/.
113
 
114
NIST. 2007. Text REtrieval Conference (TREC). http://trec.nist.gov/.
 
115
Nowson, S. 2006. The language of Weblogs: A study of genre and individual differences. Ph.D. dissertation, University of Edinburgh, College of Science and Engineering, Edinburgh, Scotland.
116
 
117
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the Web. Unpublished draft. Stanford University, Stanford, CA.
 
118
Park, S.-B. and Zhang, B.-T. 2003. Large scale unstructured document classification using unlabeled data and syntactic information. In Advances in Knowledge Discovery and Data Mining: 7th Pacific-Asia Conference (PAKDD). Lecture Notes in Computer Science, vol. 2637. Springer, Berlin, Germany, 88--99.
119
 
120
Pazzani, M., Muramatsu, J., and Billsus, D. 1996. Syskill & Webert: Identifying interesting Web sites. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 54--61.
 
121
 
122
Pierre, J. M. 2001. On the automated classification of Web sites. Linköping Electron. Art. Comput. Inform. Sci. 6. http://www.ep.liu.se/ea/cis/2001/001/.
123
 
124
Qu, H., Pietra, A. L., and Poon, S. 2006. Automated blog classification: Challenges and pitfalls. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 184--186. Tech. rep. SS-06-03.
125
 
126
Riboni, D. 2002. Feature selection for Web page classification. In Proceedings of the Workshop on Web Content Mapping: A Challenge to ICT (EURASIA-ICT).
127
 
128
Rosenfeld, A., Hummel, R., and Zucker, S. 1976. Scene labeling by relaxation operations. IEEE Trans. Syst. Man Cybernet. 6, 420--433.
 
129
 
130
 
131
Sebastiani, F. 1999. A tutorial on automated text categorisation. In Proceedings of the 1st Argentinean Symposium on Artificial Intelligence (ASAI). 7--35.
132
133
134
 
135
Sen, P. and Getoor, L. 2007. Link-based classification. Tech. rep. CS-TR-4858. University of Maryland, College Park, MD.
 
136
Shanks, V. and Williams, H. E. 2001. Fast categorisation of large document collections. In Proceedings of the Eighth International Symposium on String Processing and Information Retrieval (SPIRE). 194--204.
137
138
 
139
 
140
141
 
142
Sun, A., Suryanto, M. A., and Liu, Y. 2007. Blog classification using tags: An empirical study. In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers. Lecture Notes in Computer Science, vol. 4822. Springer, Berlin, Germany, 307--316.
 
143
Tan, A.-H. 1999. Text mining: The state of the art and the challenges. In Proceedings of the PAKDD Workshop on Knowledge Discoverery from Advanced Databases. 65--70.
144
 
145
 
146
 
147
Utard, H. and Fürnkranz, J. 2005. Link-local features for hypertext classification. In Semantics, Web and Mining: Joint International Workshops, EWMF/KDO. Lecture Notes in Computer Science, vol. 4289. Springer, Berlin, Germany, 51--64.
 
148
Veres, C. 2006. The language of folksonomies: What tags reveal about user classification. In Natural Language Processing and Information Systems. Lecture Notes in Computer Science, vol. 3999. Springer, Berlin/Heidelberg, Germany, 58--69.
149
150
151
 
152
153
 
154
155
156
 
157
 
158
 
159
160
 
161
 
162
163
 
164
Zhang, D. and Lee, W. S. 2003. Question classification using support vector machines. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM Press, New York, NY, 26--32.
165
166
167
 
168
zu Eissen, S. M. and Stein, B. 2004. Genre classification of Web pages. In Proceedings of the 27th German Conference on Artificial Intelligence. Lecture Notes in Computer Science, vol. 3238. Springer, Berlin, Germany, 256--269.


Collaborative Colleagues:
Xiaoguang Qi: colleagues
Brian D. Davison: colleagues