| Using web structure for classifying and describing web pages |
| Full text |
Pdf
(136 KB)
|
| Source
|
International World Wide Web Conference
archive
Proceedings of the 11th international conference on World Wide Web
table of contents
Honolulu, Hawaii, USA
SESSION: Description and Analysis
table of contents
Pages: 562 - 569
Year of Publication: 2002
ISBN:1-58113-449-5
|
|
Authors
|
|
Eric J. Glover
|
NEC Research Institute, Princeton, NJ
|
|
Kostas Tsioutsiouliklis
|
NEC Research Institute, Princeton, NJ and Princeton University, Princeton, NJ
|
|
Steve Lawrence
|
NEC Research Institute, Princeton, NJ
|
|
David M. Pennock
|
NEC Research Institute, Princeton, NJ
|
|
Gary W. Flake
|
NEC Research Institute, Princeton, NJ
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 6, Downloads (12 Months): 74, Citation Count: 42
|
|
|
ABSTRACT
The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
N. Abramson. Information Theory and Coding. McGraw-Hill, New York, 1963.
|
| |
2
|
G. Attardi, A. Gullí, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In C. Hutchison and G. Lanzarone, editors, Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, pages 105--119, Varese, IT, 1999.
|
 |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
 |
8
|
Gary William Flake , Steve Lawrence , C. Lee Giles, Efficient identification of Web communities, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.150-160, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347121]
|
| |
9
|
|
| |
10
|
Eric J. Glover , Gary W. Flake , Steve Lawrence , Andries Kruger , David M. Pennock , William P. Birmingham , C. Lee Giles, Improving Category Specific Web Search by Learning Query Modifications, Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001), p.23, January 08-12, 2001
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
J. T.-Y. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing (ICONIP), pages 347--351, Kitakyushu, Japan, 1999.
|
| |
15
|
S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400(July 8):107--109, 1999.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
CITED BY 42
|
|
|
|
|
Ronald Fagin , Ravi Kumar , Kevin S. McCurley , Jasmine Novak , D. Sivakumar , John A. Tomlin , David P. Williamson, Searching the workplace web, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary
|
|
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
|
|
|
Pável Calado , Marco Cristo , Edleno Moura , Nivio Ziviani , Berthier Ribeiro-Neto , Marcos André Gonçalves, Combining link-based and content-based methods for web document classification, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
Zheng Chen , Shengping Liu , Liu Wenyin , Geguang Pu , Wei-Ying Ma, Building a web thesaurus from web link structure, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada
|
|
|
|
|
|
Dou Shen , Zheng Chen , Qiang Yang , Hua-Jun Zeng , Benyu Zhang , Yuchang Lu , Wei-Ying Ma, Web-page classification through summarization, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
|
|
|
|
|
|
Tien Nhut Nguyen , Ethan Vincent Munson , Cheng Thao, Fine-grained, structured configuration management for web projects, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Baoping Zhang , Yuxin Chen , Weiguo Fan , Edward A. Fox , Marcos Gonçalves , Marco Cristo , Pável Calado, Intelligent GP fusion from multiple sources for text classification, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany
|
|
|
Thierson Couto , Marco Cristo , Marcos André Gonçalves , Pável Calado , Nivio Ziviani , Edleno Moura , Berthier Ribeiro-Neto, A comparative study of citations and links in document classification, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
Eric Glover , David M. Pennock , Steve Lawrence , Robert Krovetz, Inferring hierarchical descriptions, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA
|
|
|
|
|
|
|
|
|
Jian-Tao Sun , Ben-Yu Zhang , Zheng Chen , Yu-Chang Lu , Chun-Yi Shi , Wei-Ying Ma, GE-CKO: A Method to Optimize Composite Kernels for Web Page Classification, Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, p.299-305, September 20-24, 2004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gui-Rong Xue , Yong Yu , Dou Shen , Qiang Yang , Hua-Jun Zeng , Zheng Chen, Reinforcing Web-object Categorization Through Interrelationships, Data Mining and Knowledge Discovery, v.12 n.2-3, p.229-248, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xiaoxun Zhang , Xueying Wang , Honglei Guo , Zhili Guo , Xian Wu , Zhong Su, Floatcascade learning for fast imbalanced web mining, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|