ACM Home Page
Please provide us with feedback. Feedback
Enhanced topic distillation using text, markup tags, and hyperlinks
Full text PdfPdf (386 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
New Orleans, Louisiana, United States
Pages: 208 - 216  
Year of Publication: 2001
ISBN:1-58113-331-6
Authors
Soumen Chakrabarti  IIT Bombay, India
Mukul Joshi  IIT Bombay, India
Vivek Tawde  IIT Bombay, India
Sponsor
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 77,   Citation Count: 36
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/383952.383990
What is a DOI?

ABSTRACT

Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem gets especially severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. Using over 50 queries, 28 from earlier topic distillation work, we analyzed over 700,000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
 
5
C. Buckley, M. Mitra, J. Waltz, and C. Cardie. Using clustering and SuperConcepts within SMART: TREC6. In Proceedings of the Sixth Text Retrieval Conference (TREC6), Gaithersburg, MD, 1998. National Institute of Standards and Technology (NIST). Online at http://www.cs.cornell.edu/home/ cardie/papers/trec6-ipm.ps.
6
 
7
 
8
 
9
 
10
G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, London, 1989.
 
11
 
12
D. S. Johnson and K. A. Niemi. On knapsacks, partitions, and a new dynamic programming technique for trees. Mathematics of Operations Research, 8(1):1-14, 1983.
 
13
 
14
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Software available from http://www.cs.cmu.edu/~mccallum/bow/, 1998.
 
15
 
16
 
17
K. Richmond, A. Smith, and E. Amitay. Detecting subject boundaries within text: A language independent statistical approach. In Empirical Methods in Natural Language Processing, volume 2, Providence, RI, 1997. Online at http://www.ics.mq. edu.au/~einat/publications.html.
 
18
 
19
20
 
21

CITED BY  36

Collaborative Colleagues:
Soumen Chakrabarti: colleagues
Mukul Joshi: colleagues
Vivek Tawde: colleagues