|
ABSTRACT
Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces must be searched (e.g., when tuning parameters). We present a technique for automatically evaluating strategies using Web hierarchies, such as Open Directory, in place of user feedback. We apply this evaluation methodology to a mix of document representation strategies, including the use of text, anchor-text, and links. We discuss the relative advantages and disadvantages of the various approaches examined. Finally, we describe how to efficiently construct a similarity index out of our chosen strategies, and provide sample results from our index.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
E. Amitay. Using Common Hypertext Links to Identify the Best Phrasal Description of Target Web Documents. Proceedings of SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web, 1998.
|
| |
2
|
G. Attardi, A. Gull, and F. Sebastiani. Theseus: Categorization by context. Proceedings of WWW8, 1999.
|
| |
3
|
|
| |
4
|
A. Broder. Filtering Near-duplicate Documents. Proceedings of FUN, 1998.
|
| |
5
|
|
 |
6
|
Andrei Z. Broder , Moses Charikar , Alan M. Frieze , Michael Mitzenmacher, Min-wise independent permutations (extended abstract), Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.327-336, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276781]
|
| |
7
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
 |
8
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
9
|
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text, Proceedings of the seventh international conference on World Wide Web 7, p.65-74, April 1998, Brisbane, Australia
|
| |
10
|
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding Interesting Associations without Support Pruning. Proceedings of ICDE, 2000.
|
 |
11
|
|
| |
12
|
|
| |
13
|
L. A. Goodman and W. H. Kruskal. Measures of association for cross classifications. J. of Amer. Stat. Assoc., 49:732--764, 1954.
|
| |
14
|
T.H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. Informal Proceedings of the International Workshop on the Web and Databases, WebDB, 2000.
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
H. P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2:159--165, 1958.
|
| |
21
|
Open Directory Project (ODP). http://www.dmoz.com/.
|
| |
22
|
M. Porter. An Algorithm for Suffix Stripping. Program: Automated Library and Information Systems, 14(3):130--137, 1980.
|
| |
23
|
|
| |
24
|
S. Siegel and N. J. Castellan. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988.
|
| |
25
|
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. TextMining Workshop, KDD, 2000.
|
| |
26
|
Yahoo! http://www.yahoo.com/.
|
CITED BY 39
|
|
Sariel Har-Peled , Vladlen Koltun , Dezhen Song , Ken Goldberg, Efficient algorithms for shared camera control, Proceedings of the nineteenth annual symposium on Computational geometry, June 08-10, 2003, San Diego, California, USA
|
|
|
|
|
|
Ronald Fagin , Ravi Kumar , Mohammad Mahdian , D. Sivakumar , Erik Vee, Comparing and aggregating rankings with ties, Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 14-16, 2004, Paris, France
|
|
|
|
|
|
Steven M. Beitzel , Eric C. Jensen , Abdur Chowdhury , David Grossman, Using titles and category names from editor-driven taxonomies for automatic evaluation, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Moisés G. de Carvalho , Marcos André Gonçalves , Alberto H. F. Laender , Altigran S. da Silva, Learning to deduplicate, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
Junghoo Cho , Hector Garcia-Molina , Taher Haveliwala , Wang Lam , Andreas Paepcke , Sriram Raghavan , Gary Wesley, Stanford WebBase components and applications, ACM Transactions on Internet Technology (TOIT), v.6 n.2, p.153-186, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dmitri Roussinov , Leon J. Zhao , Weiguo Fan, Mining context specific similarity relationships using the world wide web, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, p.499-506, October 06-08, 2005, Vancouver, British Columbia, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gang Luo , Chunqiang Tang , Hao Yang , Xing Wei, MedSearch: a specialized search engine for medical information retrieval, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|