| Entropy-based link analysis for mining web informative structures |
| Full text |
Pdf
(564 KB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the eleventh international conference on Information and knowledge management
table of contents
McLean, Virginia, USA
SESSION: Web search 2
table of contents
Pages: 574 - 581
Year of Publication: 2002
ISBN:1-58113-492-4
|
|
Authors
|
|
Hung-Yu Kao
|
National Taiwan University, Taipei, Taiwan, ROC
|
|
Ming-Syan Chen
|
National Taiwan University, Taipei, Taiwan, ROC
|
|
Shian-Hua Lin
|
Academia Sinica, Taipei, Taiwan, ROC
|
|
Jan-Ming Ho
|
Academia Sinica, Taipei, Taiwan, ROC
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 7, Downloads (12 Months): 91, Citation Count: 9
|
|
|
ABSTRACT
In this paper, we study the problem of mining the informative structure of a news Web site which consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by TOC pages through informative links. It is noted that the Hyperlink Induced Topics Search (HITS) algorithm has been employed to provide a solution to analyzing authorities and hubs of pages. However, most of the content sites tend to contain some extra hyperlinks, such as navigation panels, advertisements and banners, so as to increase the add-on values of their Web pages. Therefore, due to the structure induced by these extra hyperlinks, HITS is found to be insufficient to provide a good precision in solving the problem. To remedy this, we develop an algorithm to utilize entropy-based Link Analysis on Mining Web Informative Structures. This algorithm is referred to as LAMIS. The key idea of LAMIS is to utilize information entropy for representing the knowledge that corresponds to the amount of information in a link or a page in the link analysis. Experiments on several real news Web sites show that the precision and the recall of LAMIS are much superior to those obtained by heuristic methods and conventional ink analysis methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
6
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.309-320, June 2000, Amsterdam, The Netherlands
|
 |
7
|
Soumen Chakrabarti , Mukul Joshi , Vivek Tawde, Enhanced topic distillation using text, markup tags, and hyperlinks, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.208-216, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383990]
|
 |
8
|
|
| |
9
|
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text, Proceedings of the seventh international conference on World Wide Web 7, p.65-74, April 1998, Brisbane, Australia
|
| |
10
|
Soumen Chakrabarti , Byron E. Dom , S. Ravi Kumar , Prabhakar Raghavan , Sridhar Rajagopalan , Andrew Tomkins , David Gibson , Jon Kleinberg, Mining the Web's Link Structure, Computer, v.32 n.8, p.60-67, August 1999
[doi> 10.1109/2.781636]
|
 |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
B. D. Davison. Recognizing Nepotistic Links on the Web. Proc. of AAAI 2000.
|
 |
15
|
|
| |
16
|
|
| |
17
|
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In Proc. of the 15th International Joint Conference on Artificial Intelligence (IJCAI), 1997.
|
| |
18
|
|
 |
19
|
Wen-Syan Li , Necip Fazil Ayan , Okan Kolak , Quoc Vu , Hajime Takano , Hisashi Shimamura, Constructing multi-granular and topic-focused web site maps, Proceedings of the 10th international conference on World Wide Web, p.343-354, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372086]
|
 |
20
|
Peter Pirolli , James Pitkow , Ramana Rao, Silk from a sow's ear: extracting usable structures from the Web, Proceedings of the SIGCHI conference on Human factors in computing systems: common ground, p.118-125, April 13-18, 1996, Vancouver, British Columbia, Canada
[doi> 10.1145/238386.238450]
|
| |
21
|
|
| |
22
|
C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:398--403, 1948.
|
| |
23
|
W3C DOM. Document Object Model (DOM). http://www.w3.org/DOM/.
|
|