|
ABSTRACT
Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem gets especially severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. Using over 50 queries, 28 from earlier topic distillation work, we analyzed over 700,000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
Allan Borodin , Gareth O. Roberts , Jeffrey S. Rosenthal , Panayiotis Tsaparas, Finding authorities and hubs from link structures on the World Wide Web, Proceedings of the 10th international conference on World Wide Web, p.415-429, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372096]
|
| |
4
|
|
| |
5
|
C. Buckley, M. Mitra, J. Waltz, and C. Cardie. Using clustering and SuperConcepts within SMART: TREC6. In Proceedings of the Sixth Text Retrieval Conference (TREC6), Gaithersburg, MD, 1998. National Institute of Standards and Technology (NIST). Online at http://www.cs.cornell.edu/home/ cardie/papers/trec6-ipm.ps.
|
 |
6
|
|
| |
7
|
Soumen Chakrabarti , Byron E. Dom , S. Ravi Kumar , Prabhakar Raghavan , Sridhar Rajagopalan , Andrew Tomkins , David Gibson , Jon Kleinberg, Mining the Web's Link Structure, Computer, v.32 n.8, p.60-67, August 1999
[doi> 10.1109/2.781636]
|
| |
8
|
|
| |
9
|
|
| |
10
|
G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, London, 1989.
|
| |
11
|
|
| |
12
|
D. S. Johnson and K. A. Niemi. On knapsacks, partitions, and a new dynamic programming technique for trees. Mathematics of Operations Research, 8(1):1-14, 1983.
|
| |
13
|
|
| |
14
|
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Software available from http://www.cs.cmu.edu/~mccallum/bow/, 1998.
|
| |
15
|
|
| |
16
|
|
| |
17
|
K. Richmond, A. Smith, and E. Amitay. Detecting subject boundaries within text: A language independent statistical approach. In Empirical Methods in Natural Language Processing, volume 2, Providence, RI, 1997. Online at http://www.ics.mq. edu.au/~einat/publications.html.
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
CITED BY 36
|
|
|
|
|
|
|
|
Hung-Yu Kao , Ming-Syan Chen , Shian-Hua Lin , Jan-Ming Ho, Entropy-based link analysis for mining web informative structures, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gui-Rong Xue , Qiang Yang , Hua-Jun Zeng , Yong Yu , Zheng Chen, Exploiting the hierarchical structure for link analysis, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
|
Mingfang Wu , Gheorghe Muresan , Alistair McLean , Muh-Chyun (Morris) Tang , Ross Wilkinson , Yuelin Li , Hyuk-Jin Lee , Nichloas J. Belkin, Human versus machine in the topic distillation task, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
|
|
|
Tao Qin , Tie-Yan Liu , Xu-Dong Zhang , Zheng Chen , Wei-Ying Ma, A study of relevance propagation for web search, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
|
|
|
|
|
|
|
Karane Vieira , Altigran S. da Silva , Nick Pinto , Edleno S. de Moura , João M. B. Cavalcanti , Juliana Freire, A fast and robust method for web page template detection and removal, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
Tao Qin , Tie-Yan Liu , Xu-Dong Zhang , Guang Feng , De-Sheng Wang , Wei-Ying Ma, Topic distillation via sub-site retrieval, Information Processing and Management: an International Journal, v.43 n.2, p.445-460, March 2007
|
|
|
|
|
|
|
|
|
|
|
|
David Fernandes , Edleno S. de Moura , Berthier Ribeiro-Neto , Altigran S. da Silva , Marcos André Gonçalves, Computing block importance for searching on web sites, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
|
|
|
|
|
|
|
|
|
Karane Vieira , André Luiz Costa Carvalho , Klessius Berlt , Edleno S. Moura , Altigran S. Silva , Juliana Freire, On Finding Templates on Web Collections, World Wide Web, v.12 n.2, p.171-211, June 2009
|
|
|
K. Selçuk Candan , Mehmet E. Dönderler , Terri Hedgpeth , Jong Wook Kim , Qing Li , Maria Luisa Sapino, SEA: Segment-enrich-annotate paradigm for adapting dialog-based content for improved accessibility, ACM Transactions on Information Systems (TOIS), v.27 n.3, p.1-45, May 2009
|
|
|
|
|
|
|
|
|
|
|