ACM Home Page
Please provide us with feedback. Feedback
Evaluating strategies for similarity search on the web
Full text PdfPdf (269 KB)
Source International World Wide Web Conference archive
Proceedings of the 11th international conference on World Wide Web table of contents
Honolulu, Hawaii, USA
SESSION: Search 2 table of contents
Pages: 432 - 442  
Year of Publication: 2002
ISBN:1-58113-449-5
Authors
Taher H. Haveliwala  Stanford University, Stanford, CA
Aristides Gionis  Stanford University, Stanford, CA
Dan Klein  Stanford University, Stanford, CA
Piotr Indyk  Laboratory of Computer Science, Cambridge, MA
Sponsors
ACM: Association for Computing Machinery
: WWW'02
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 97,   Citation Count: 39
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/511446.511502
What is a DOI?

ABSTRACT

Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces must be searched (e.g., when tuning parameters). We present a technique for automatically evaluating strategies using Web hierarchies, such as Open Directory, in place of user feedback. We apply this evaluation methodology to a mix of document representation strategies, including the use of text, anchor-text, and links. We discuss the relative advantages and disadvantages of the various approaches examined. Finally, we describe how to efficiently construct a similarity index out of our chosen strategies, and provide sample results from our index.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
E. Amitay. Using Common Hypertext Links to Identify the Best Phrasal Description of Target Web Documents. Proceedings of SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web, 1998.
 
2
G. Attardi, A. Gull, and F. Sebastiani. Theseus: Categorization by context. Proceedings of WWW8, 1999.
 
3
 
4
A. Broder. Filtering Near-duplicate Documents. Proceedings of FUN, 1998.
 
5
6
 
7
8
 
9
 
10
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding Interesting Associations without Support Pruning. Proceedings of ICDE, 2000.
11
 
12
 
13
L. A. Goodman and W. H. Kruskal. Measures of association for cross classifications. J. of Amer. Stat. Assoc., 49:732--764, 1954.
 
14
T.H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. Informal Proceedings of the International Workshop on the Web and Databases, WebDB, 2000.
 
15
 
16
17
 
18
 
19
 
20
H. P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2:159--165, 1958.
 
21
Open Directory Project (ODP). http://www.dmoz.com/.
 
22
M. Porter. An Algorithm for Suffix Stripping. Program: Automated Library and Information Systems, 14(3):130--137, 1980.
 
23
 
24
S. Siegel and N. J. Castellan. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988.
 
25
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. TextMining Workshop, KDD, 2000.
 
26
Yahoo! http://www.yahoo.com/.

CITED BY  39

Collaborative Colleagues:
Taher H. Haveliwala: colleagues
Aristides Gionis: colleagues
Dan Klein: colleagues
Piotr Indyk: colleagues