ACM Home Page
Please provide us with feedback. Feedback
Web-page classification through summarization
Full text PdfPdf (226 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Sheffield, United Kingdom
SESSION: Text classification table of contents
Pages: 242 - 249  
Year of Publication: 2004
ISBN:1-58113-881-4
Authors
Dou Shen  Tsinghua University, Beijing, P.R. China
Zheng Chen  Microsoft Research Asia, Beijing, P.R. China
Qiang Yang  Hong Kong University of Science and Technology, Kowloon, Hong Kong
Hua-Jun Zeng  Microsoft Research Asia, Beijing, P.R. China
Benyu Zhang  Microsoft Research Asia, Beijing, P.R. China
Yuchang Lu  Tsinghua University, Beijing, P.R. China
Wei-Ying Ma  Microsoft Research Asia, Beijing, P.R. China
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 35,   Downloads (12 Months): 239,   Citation Count: 18
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1008992.1009035
What is a DOI?

ABSTRACT

Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
G. Attardi, A. Gulli, and F. Sebastiani. Automatic Web Page Categorization by Link and Context Analysis. In Chris Hutchison and Gaetano Lanzarone (eds.), Proc. of THAI'99, 1999, 105--119.
2
 
3
4
5
6
7
8
9
 
10
 
11
S. Deerwester, S. Dumais, G. Furnas, T. Landauer,and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41, 1990, 391--407.
 
12
J.-Y. Delort, B. Bouchon-Meunier and M. Rifqi. Web Document Summarization by Context. Poster Proc. of WWW12, 2003.
13
14
 
15
 
16
 
17
 
18
19
20
 
21
 
22
T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to Latent Semantic Analysis. Discourse processes, 25, 1998, 259--284.
 
23
H.P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, Vol. 2, No. 2, April 1958, 159--165.
 
24
A. McCallum and K. Nigam, A comparison of event models for naive bayes text classification, In AAAI-98 Workshop on Learning for Text Categorization, 1998.
 
25
 
26
27
 
28
Sequential Minimal Optimization, http://research.micro-soft.com/jplatt/smo.html.
 
29
The Porter Stemming Algorithm, http://www.tartarus.org/martin/PorterStemmer.
 
30
S. Teufel and M. Moens. Sentence extraction as a classification task. In ACL/EACL-97 Workshop on Intelligent and Scalable Text Summarization, 1997.
 
31
 
32
33
 
34

CITED BY  18


REVIEW

"Christoph F. Strnadl : Reviewer"

Undoubtedly, information retrieval is pivotal to the continued success of the Web. One means of accomplishing this retrieval is by way of a significant categorization of Web pages (which is nontrivial, given the idiosyncratic Hypertext Markup Lang  more...

Collaborative Colleagues:
Dou Shen: colleagues
Zheng Chen: colleagues
Qiang Yang: colleagues
Hua-Jun Zeng: colleagues
Benyu Zhang: colleagues
Yuchang Lu: colleagues
Wei-Ying Ma: colleagues