|
ABSTRACT
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Aas, K. and Eikvil, L. 1999. Text categorisation: A survey. Tech. rep. 941. Norwegian Computing Center, Oslo, Norway.
|
 |
2
|
|
| |
3
|
Amitay, E. 1998. Using common hypertext links to identify the best phrasal description of target Web documents. In Proceedings of the SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web (Melbourne, Australia).
|
 |
4
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
[doi> 10.1145/900051.900060]
|
 |
5
|
|
 |
6
|
|
| |
7
|
Armstrong, R., Freitag, D., Joachims, T., and Mitchell, T. 1995. WebWatcher: A learning apprentice for the World Wide Web. In Proceedings of the AAAI Spring Symposium on Information Gathering from Distributed, Heterogeneous Environments. AAAI Press, Menlo Park, CA, 6--12.
|
| |
8
|
Asirvatham, A. P. and Ravi, K. K. 2001. Web page classification based on document structure. Awarded second prize in National Level Student Paper Contest conducted by IEEE India Council.
|
| |
9
|
Attardi, G., Gulli, A., and Sebastiani, F. 1999. Automatic Web page categorization by link and context analysis. In Proceedings of First European Symposium on Telematics, Hypermedia and Artificial Intelligence (THAI, Varese, Italy), C. Hutchison and G. Lanzarone, Eds., 105--119.
|
 |
10
|
|
| |
11
|
|
| |
12
|
Berendt, B. and Hanser, C. 2007. Tags are not metadata, but “just more content”—to some people. In Proceedings of the International Conference on Weblogs and Social Media. 26--28.
|
 |
13
|
|
 |
14
|
|
 |
15
|
Andrei Z. Broder , Marcus Fontoura , Evgeniy Gabrilovich , Amruta Joshi , Vanja Josifovski , Tong Zhang, Robust classification of rare queries using web knowledge, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277783]
|
 |
16
|
Chris Burges , Tal Shaked , Erin Renshaw , Ari Lazier , Matt Deeds , Nicole Hamilton , Greg Hullender, Learning to rank using gradient descent, Proceedings of the 22nd international conference on Machine learning, p.89-96, August 07-11, 2005, Bonn, Germany
[doi> 10.1145/1102351.1102363]
|
 |
17
|
Pável Calado , Marco Cristo , Edleno Moura , Nivio Ziviani , Berthier Ribeiro-Neto , Marcos André Gonçalves, Combining link-based and content-based methods for web document classification, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956938]
|
 |
18
|
Yunbo Cao , Jun Xu , Tie-Yan Liu , Hang Li , Yalou Huang , Hsiao-Wuen Hon, Adapting ranking SVM to document retrieval, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
[doi> 10.1145/1148170.1148205]
|
| |
19
|
Cardoso-Cachopo, A. and Oliveira, A. L. 2003. An empirical comparison of text categorization methods. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 2857. Springer, Berlin, Germany, 183--196.
|
 |
20
|
Carlos Castillo , Debora Donato , Aristides Gionis , Vanessa Murdock , Fabrizio Silvestri, Know your neighbors: web spam detection using the web topology, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277814]
|
 |
21
|
|
| |
22
|
|
 |
23
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
 |
24
|
Soumen Chakrabarti , Mukul M. Joshi , Kunal Punera , David M. Pennock, The structure of broad topics on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511480]
|
| |
25
|
|
| |
26
|
Chekuri, C., Goldwasser, M., Raghavan, P., and Upfal, E. 1997. Web search using automated classification. In Proceedings of the Sixth International World Wide Web Conference (Santa Clara, CA). Poster POS725.
|
 |
27
|
|
| |
28
|
|
| |
29
|
Chesley, P., Vincent, B., Xu, L., and Srihari, R. K. 2006. Using verbs and adjectives to automatically classify blog sentiment. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 27--29. Technical Report SS-06-03.
|
 |
30
|
Paul - Alexandru Chirita , Stefania Costache , Wolfgang Nejdl , Siegfried Handschuh, P-TAG: large scale automatic generation of personalized annotation tags for the web, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242686]
|
| |
31
|
Choi, B. and Yao, Z. 2005. Web page classification. In Foundations and Advances in Data Mining, W. Chu and T. Y. Lin, Eds. Studies in Fuzziness and Soft Computing, vol. 180. Springer-Verlag, Berlin, Germany, 221--274.
|
| |
32
|
Cohen, W. W. 2002. Improving a page classifier with anchor extraction and link analysis. In Advances in Neural Information Processing Systems, S. Becker, S. Thrun, and K. Obermayer, Eds. Vol. 15. MIT Press, Cambridge, MA, 1481--1488.
|
| |
33
|
Cohn, D. and Hofmann, T. 2001. The missing link—a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems (NIPS), vol. 13. MIT Press, Cambridge, MA.
|
| |
34
|
Mark Craven , Dan DiPasquo , Dayne Freitag , Andrew McCallum , Tom Mitchell , Kamal Nigam , Seán Slattery, Learning to extract symbolic knowledge from the World Wide Web, Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, p.509-516, July 1998, Madison, Wisconsin, United States
|
 |
35
|
|
 |
36
|
|
| |
37
|
Davison, B. D. 2004. The potential of the metasearch engine. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology. Vol. 41. American Society for Information Science & Technology, Providence, RI, 393--402.
|
| |
38
|
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.
|
| |
39
|
Dietterich, T. G. and Bakiri, G. 1995. Solving multiclass learning problems via error-correcting output codes. J. Artic. Intell. Res. 2, 263--286.
|
 |
40
|
AnHai Doan , Jayant Madhavan , Pedro Domingos , Alon Halevy, Learning to map between ontologies on the semantic web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511532]
|
| |
41
|
Drost, I., Bickel, S., and Scheffer, T. 2005. Discovering communities in linked data by multi-view clustering. In From Data and Information Analysis to Knowledge Engineering: Proceedings of 29th Annual Conference of the German Classification Society. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Germany, 342--349.
|
| |
42
|
Duda, R. O. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York, NY.
|
 |
43
|
|
| |
44
|
Elgersma, E. and de Rijke, M. 2006. Learning to recognize blogs: A preliminary exploration. In EACL Workshop: New Text—Wikis and blogs and other dynamic text sources.
|
 |
45
|
Martin Ester , Hans-Peter Kriegel , Matthias Schubert, Web site mining: a new way to spot competitors, customers and suppliers in the world wide web, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
[doi> 10.1145/775047.775084]
|
| |
46
|
Fisher, M. J. and Everson, R. M. 2003. When are links useful? Experiments in text classification. In Advances in Information Retrieval. Proceedings of the 25th European Conference on IR Research. 41--56.
|
 |
47
|
|
| |
48
|
|
| |
49
|
Fürnkranz, J. 2001. Hyperlink ensembles: A case study in hypertext classification. J. Inform. Fus. 1, 299--312.
|
| |
50
|
Fürnkranz, J. 2005. Web mining. In The Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Springer, Berlin, Germany, 899--920.
|
 |
51
|
|
| |
52
|
Gabrilovich, E. and Markovitch, S. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 19th International Joint Conference for Artificial Intelligence (IJCAI). 1048--1053.
|
| |
53
|
Gabrilovich, E. and Markovitch, S. 2006. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 1301--1306.
|
| |
54
|
|
 |
55
|
|
| |
56
|
|
| |
57
|
|
| |
58
|
|
| |
59
|
Glance, N. S. 2000. Community search assistant. In Artificial Intelligence for Web Search. AAAI Press Mento Park, CA, 29--34. Presented at the AAAI-2000 Workshop on Artificial Intelligence for Web Search, Technical Rep. WS-00-01.
|
 |
60
|
Eric J. Glover , Kostas Tsioutsiouliklis , Steve Lawrence , David M. Pennock , Gary W. Flake, Using web structure for classifying and describing web pages, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511520]
|
| |
61
|
Golub, K. and Ardo, A. 2005. Importance of HTML structural elements and metadata in automated subject classification. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL). Lecture Notes in Computer Science, vol. 3652. Springer, Berlin, Germany, 368--378.
|
 |
62
|
Norbert Gövert , Mounia Lalmas , Norbert Fuhr, A probabilistic description-oriented approach for categorizing web documents, Proceedings of the eighth international conference on Information and knowledge management, p.475-482, November 02-06, 1999, Kansas City, Missouri, United States
[doi> 10.1145/319950.320053]
|
| |
63
|
|
| |
64
|
Gyöngyi, Z. and Garcia-Molina, H. 2005b. Web spam taxonomy. In Proceedings of the First International Workshop on Adversarial Information Retrieval (AIRWeb), B. D. Davison, Ed. Lehigh University, Department of Computer Science, Bethlehem, PA, 39--47. Technical rep. LU-CSE-05-030.
|
| |
65
|
|
| |
66
|
|
| |
67
|
|
| |
68
|
He, X., Zha, H., Ding, C. H. Q., and Simon, H. D. 2002. Web document clustering using hyperlink structures. Computat. Stat. Data Anal. 41, 1, 19--45.
|
| |
69
|
|
 |
70
|
|
 |
71
|
|
 |
72
|
|
 |
73
|
|
| |
74
|
Robert Jäschke , Leandro Marinho , Andreas Hotho , Lars Schmidt-Thieme , Gerd Stumme, Tag Recommendations in Folksonomies, Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases, September 17-21, 2007, Warsaw, Poland
[doi> 10.1007/978-3-540-74976-9_52]
|
 |
75
|
|
 |
76
|
|
| |
77
|
|
| |
78
|
Joachims, T., Freitag, D., and Mitchell, T. 1997. WebWatcher: A tour guide for the World Wide Web. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA, 770--775.
|
 |
79
|
|
 |
80
|
|
 |
81
|
|
| |
82
|
|
| |
83
|
|
 |
84
|
|
 |
85
|
|
| |
86
|
Kovacevic, M., Diligenti, M., Gori, M., and Milutinovic, V. 2004. Visual adjacency multigraphs—a novel approach for a Web page classification. In Proceedings of the Workshop on Statistical Approaches to Web Mining (SAWM). 38--49.
|
| |
87
|
|
 |
88
|
|
 |
89
|
|
 |
90
|
|
 |
91
|
|
| |
92
|
|
 |
93
|
|
 |
94
|
|
 |
95
|
Tie-Yan Liu , Yiming Yang , Hao Wan , Hua-Jun Zeng , Zheng Chen , Wei-Ying Ma, Support vector machines classification with a very large-scale taxonomy, ACM SIGKDD Explorations Newsletter, v.7 n.1, p.36-43, June 2005
[doi> 10.1145/1089815.1089821]
|
| |
96
|
Liu, W., Xue, G.-R., Yu, Y., and Zeng, H.-J. 2005b. Importance-based Web page classification using cost-sensitive SVM. Adv. Web-Age Inform. Manage. 3739, 127--137.
|
| |
97
|
Loia, V. and Senatore, S. 2006a. Personalized knowledge models using RDF-based fuzzy classification. Stud. Fuzz. Soft Comput. 197, 45--64.
|
| |
98
|
Loia, V. and Senatore, S. 2006b. Proximity-based supervision for flexible Web pages categorization. In Fuzzy Logic and the Semantic Web, Elsevier, The Netherlands, 46--69.
|
| |
99
|
Lu, Q. and Getoor, L. 2003. Link-based classification. In Proceedings of the 20th International Conference on Machine Learning (ICML). AAAI Press, Menlo Park, CA.
|
| |
100
|
Luxenburger, J. and Weikum, G. 2004. Query-log based authority analysis for Web information search. In Proceedings of the 5th International Conference on Web Information Systems Engineering (WISE). Lecture Notes in Computer Science, vol. 3306. Springer, Berlin, Germany, 90--101.
|
| |
101
|
|
 |
102
|
|
| |
103
|
|
| |
104
|
Mihalcea, R. and Liu, H. 2006. A corpus-based approach to finding happiness. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 139--144. Tech. rep. SS-06-03.
|
| |
105
|
Mishne, G. 2005. Experiments with mood classification in blog posts. In Proceedings of the Workshop on Stylistic Analysis of Text for Information Access.
|
| |
106
|
Mishne, G. and de Rijke, M. 2006. Capturing global mood levels using blog posts. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 145--152. Tech. rep. SS-06-03.
|
| |
107
|
|
| |
108
|
Mladenic, D. 1998. Turning Yahoo into an automatic Web-page classifier. In Proceedings of the European Conference on Artificial Intelligence (ECAI). 473--474.
|
| |
109
|
|
 |
110
|
Meenakshi Nagarajan , Amit Sheth , Marcos Aguilera , Kimberly Keeton , Arif Merchant , Mustafa Uysal, Altering document term vectors for classification: ontologies as expectations of co-occurrence, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242778]
|
 |
111
|
Tomoyuki Nanno , Toshiaki Fujiki , Yasuhiro Suzuki , Manabu Okumura, Automatically collecting, monitoring, and mining japanese weblogs, Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, May 19-21, 2004, New York, NY, USA
[doi> 10.1145/1013367.1013455]
|
| |
112
|
Netscape Communications Corporation. 2008. The dmoz Open Directory Project (ODP). http://www.dmoz.org/.
|
 |
113
|
|
| |
114
|
NIST. 2007. Text REtrieval Conference (TREC). http://trec.nist.gov/.
|
| |
115
|
Nowson, S. 2006. The language of Weblogs: A study of genre and individual differences. Ph.D. dissertation, University of Edinburgh, College of Science and Engineering, Edinburgh, Scotland.
|
 |
116
|
|
| |
117
|
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the Web. Unpublished draft. Stanford University, Stanford, CA.
|
| |
118
|
Park, S.-B. and Zhang, B.-T. 2003. Large scale unstructured document classification using unlabeled data and syntactic information. In Advances in Knowledge Discovery and Data Mining: 7th Pacific-Asia Conference (PAKDD). Lecture Notes in Computer Science, vol. 2637. Springer, Berlin, Germany, 88--99.
|
 |
119
|
Chintan Patel , Kaustubh Supekar , Yugyung Lee , E. K. Park, OntoKhoj: a semantic web portal for ontology searching, ranking and classification, Proceedings of the 5th ACM international workshop on Web information and data management, November 07-08, 2003, New Orleans, Louisiana, USA
[doi> 10.1145/956699.956712]
|
| |
120
|
Pazzani, M., Muramatsu, J., and Billsus, D. 1996. Syskill & Webert: Identifying interesting Web sites. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 54--61.
|
| |
121
|
|
| |
122
|
Pierre, J. M. 2001. On the automated classification of Web sites. Linköping Electron. Art. Comput. Inform. Sci. 6. http://www.ep.liu.se/ea/cis/2001/001/.
|
 |
123
|
|
| |
124
|
Qu, H., Pietra, A. L., and Poon, S. 2006. Automated blog classification: Challenges and pitfalls. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 Spring Symposium, N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, Eds. AAAI Press, Menlo Park, CA, 184--186. Tech. rep. SS-06-03.
|
 |
125
|
|
| |
126
|
Riboni, D. 2002. Feature selection for Web page classification. In Proceedings of the Workshop on Web Content Mapping: A Challenge to ICT (EURASIA-ICT).
|
 |
127
|
|
| |
128
|
Rosenfeld, A., Hummel, R., and Zucker, S. 1976. Scene labeling by relaxation operations. IEEE Trans. Syst. Man Cybernet. 6, 420--433.
|
| |
129
|
|
| |
130
|
|
| |
131
|
Sebastiani, F. 1999. A tutorial on automated text categorisation. In Proceedings of the 1st Argentinean Symposium on Artificial Intelligence (ASAI). 7--35.
|
 |
132
|
|
 |
133
|
|
 |
134
|
|
| |
135
|
Sen, P. and Getoor, L. 2007. Link-based classification. Tech. rep. CS-TR-4858. University of Maryland, College Park, MD.
|
| |
136
|
Shanks, V. and Williams, H. E. 2001. Fast categorisation of large document collections. In Proceedings of the Eighth International Symposium on String Processing and Information Retrieval (SPIRE). 194--204.
|
 |
137
|
Dou Shen , Zheng Chen , Qiang Yang , Hua-Jun Zeng , Benyu Zhang , Yuchang Lu , Wei-Ying Ma, Web-page classification through summarization, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
[doi> 10.1145/1008992.1009035]
|
 |
138
|
|
| |
139
|
|
| |
140
|
|
 |
141
|
|
| |
142
|
Sun, A., Suryanto, M. A., and Liu, Y. 2007. Blog classification using tags: An empirical study. In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers. Lecture Notes in Computer Science, vol. 4822. Springer, Berlin, Germany, 307--316.
|
| |
143
|
Tan, A.-H. 1999. Text mining: The state of the art and the challenges. In Proceedings of the PAKDD Workshop on Knowledge Discoverery from Advanced Databases. 65--70.
|
 |
144
|
|
| |
145
|
|
| |
146
|
|
| |
147
|
Utard, H. and Fürnkranz, J. 2005. Link-local features for hypertext classification. In Semantics, Web and Mining: Joint International Workshops, EWMF/KDO. Lecture Notes in Computer Science, vol. 4289. Springer, Berlin, Germany, 51--64.
|
| |
148
|
Veres, C. 2006. The language of folksonomies: What tags reveal about user classification. In Natural Language Processing and Information Systems. Lecture Notes in Computer Science, vol. 3999. Springer, Berlin/Heidelberg, Germany, 58--69.
|
 |
149
|
|
 |
150
|
|
 |
151
|
|
| |
152
|
|
 |
153
|
|
| |
154
|
Gui-Rong Xue , Yong Yu , Dou Shen , Qiang Yang , Hua-Jun Zeng , Zheng Chen, Reinforcing Web-object Categorization Through Interrelationships, Data Mining and Knowledge Discovery, v.12 n.2-3, p.229-248, May 2006
[doi> 10.1007/s10618-005-0015-5]
|
 |
155
|
Jun Yan , Ning Liu , Benyu Zhang , Shuicheng Yan , Zheng Chen , Qiansheng Cheng , Weiguo Fan , Wei-Ying Ma, OCFS: optimal orthogonal centroid feature selection for text categorization, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
[doi> 10.1145/1076034.1076058]
|
 |
156
|
|
| |
157
|
|
| |
158
|
|
| |
159
|
|
 |
160
|
|
| |
161
|
|
| |
162
|
|
 |
163
|
|
| |
164
|
Zhang, D. and Lee, W. S. 2003. Question classification using support vector machines. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM Press, New York, NY, 26--32.
|
 |
165
|
|
 |
166
|
|
 |
167
|
|
| |
168
|
zu Eissen, S. M. and Stein, B. 2004. Genre classification of Web pages. In Proceedings of the 27th German Conference on Artificial Intelligence. Lecture Notes in Computer Science, vol. 3238. Springer, Berlin, Germany, 256--269.
|
|