| Web-page classification through summarization |
| Full text |
Pdf
(226 KB)
|
| Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Sheffield, United Kingdom
SESSION: Text classification
table of contents
Pages: 242 - 249
Year of Publication: 2004
ISBN:1-58113-881-4
|
|
Authors
|
|
Dou Shen
|
Tsinghua University, Beijing, P.R. China
|
|
Zheng Chen
|
Microsoft Research Asia, Beijing, P.R. China
|
|
Qiang Yang
|
Hong Kong University of Science and Technology, Kowloon, Hong Kong
|
|
Hua-Jun Zeng
|
Microsoft Research Asia, Beijing, P.R. China
|
|
Benyu Zhang
|
Microsoft Research Asia, Beijing, P.R. China
|
|
Yuchang Lu
|
Tsinghua University, Beijing, P.R. China
|
|
Wei-Ying Ma
|
Microsoft Research Asia, Beijing, P.R. China
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 27, Downloads (12 Months): 235, Citation Count: 18
|
|
|
ABSTRACT
Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
G. Attardi, A. Gulli, and F. Sebastiani. Automatic Web Page Categorization by Link and Context Analysis. In Chris Hutchison and Gaetano Lanzarone (eds.), Proc. of THAI'99, 1999, 105--119.
|
 |
2
|
|
| |
3
|
|
 |
4
|
|
 |
5
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
 |
6
|
|
 |
7
|
Jinlin Chen , Baoyao Zhou , Jin Shi , Hongjiang Zhang , Qiu Fengwu, Function-based object model towards website adaptation, Proceedings of the 10th international conference on World Wide Web, p.587-596, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372161]
|
 |
8
|
Zheng Chen , Shengping Liu , Liu Wenyin , Geguang Pu , Wei-Ying Ma, Building a web thesaurus from web link structure, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada
[doi> 10.1145/860435.860447]
|
 |
9
|
|
| |
10
|
|
| |
11
|
S. Deerwester, S. Dumais, G. Furnas, T. Landauer,and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41, 1990, 391--407.
|
| |
12
|
J.-Y. Delort, B. Bouchon-Meunier and M. Rifqi. Web Document Summarization by Context. Poster Proc. of WWW12, 2003.
|
 |
13
|
Eric J. Glover , Kostas Tsioutsiouliklis , Steve Lawrence , David M. Pennock , Gary W. Flake, Using web structure for classifying and describing web pages, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511520]
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
 |
20
|
Julian Kupiec , Jan Pedersen , Francine Chen, A trainable document summarizer, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.68-73, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215333]
|
| |
21
|
|
| |
22
|
T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to Latent Semantic Analysis. Discourse processes, 25, 1998, 259--284.
|
| |
23
|
H.P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, Vol. 2, No. 2, April 1958, 159--165.
|
| |
24
|
A. McCallum and K. Nigam, A comparison of event models for naive bayes text classification, In AAAI-98 Workshop on Learning for Text Categorization, 1998.
|
| |
25
|
|
| |
26
|
|
 |
27
|
|
| |
28
|
Sequential Minimal Optimization, http://research.micro-soft.com/jplatt/smo.html.
|
| |
29
|
The Porter Stemming Algorithm, http://www.tartarus.org/martin/PorterStemmer.
|
| |
30
|
S. Teufel and M. Moens. Sentence extraction as a classification task. In ACL/EACL-97 Workshop on Intelligent and Scalable Text Summarization, 1997.
|
| |
31
|
|
| |
32
|
|
 |
33
|
|
| |
34
|
|
CITED BY 19
|
|
|
|
|
Jian-Tao Sun , Dou Shen , Hua-Jun Zeng , Qiang Yang , Yuchang Lu , Zheng Chen, Web-page summarization using clickthrough data, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Aris Anagnostopoulos , Andrei Z. Broder , Evgeniy Gabrilovich , Vanja Josifovski , Lance Riedel, Just-in-time contextual advertising, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
Jaehui Park , Tomohiro Fukuhara , Ikki Ohmukai , Hideaki Takeda , Sang-goo Lee, Web content summarization using social bookmarks: a new approach for social summarization, Proceeding of the 10th ACM workshop on Web information and data management, October 30-30, 2008, Napa Valley, California, USA
|
|
|
Yi Zhang , Arun C. Surendran , John C. Platt , Mukund Narasimhan, Learning from multi-topic web documents for contextual advertisement, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
Liangda Li , Ke Zhou , Gui-Rong Xue , Hongyuan Zha , Yong Yu, Enhancing diversity, coverage and balance for summarization through structure learning, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
Dou Shen , Jian-Tao Sun , Hua Li , Qiang Yang , Zheng Chen, Document summarization using conditional random fields, Proceedings of the 20th international joint conference on Artifical intelligence, p.2862-2867, January 06-12, 2007, Hyderabad, India
|
|
|
|
REVIEW
"Christoph F. Strnadl : Reviewer"
Undoubtedly, information retrieval is pivotal to the continued success of the Web. One means of accomplishing this retrieval is by way of a significant categorization of Web pages (which is nontrivial, given the idiosyncratic Hypertext Markup Lang
more...
|