|
ABSTRACT
To exploit the similarity information hidden in the hyperlink structure of the web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multi-step neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [20], a recursive refinement of cocitation; PSimRank, a novel variant with better theoretical characteristics; and the Jaccard coefficient, extended to multi-step neighborhoods. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. The performance and quality of the methods were tested on the Stanford Webbase [19] graph of 80M pages by comparing our scores to similarities extracted from the ODP directory [26]. Our experimental results suggest that the hyperlink structure of vertices within four to five steps provide more adequate information for similarity search than single-step neighborhoods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
Ziv Bar-Yossef , Andrei Z. Broder , Ravi Kumar , Andrew Tomkins, Sic transit gloria telae: towards an understanding of the web's decay, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988716]
|
| |
4
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.309-320, June 2000, Amsterdam, The Netherlands
|
| |
5
|
|
| |
6
|
|
| |
7
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
 |
8
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
D. Fogaras and B. Rácz. A scalable randomized method to compute link-based similarity rank on the web graph. Proc. of Clustering Information over the Web (ClustWeb), in conjunction with EDBT, 2004.
|
| |
13
|
D. Fogaras and B. Rácz. Scaling link-based similarity search. Technical report, MTA SZTAKI, 2004. http://www.ilab.sztaki.hu/websearch/Publications/.
|
| |
14
|
D. Fogaras and B. Rácz. Towards scaling fully personalized PageRank. Proc. of Third Workshop on Algorithms and Models for the Web-Graph (WAW) in conjunction with FOCS, 2004.
|
| |
15
|
T. H. Haveliwala. Efficient computation of PageRank. Technical Report 1999-31, Stanford University, 1999.
|
 |
16
|
Taher H. Haveliwala , Aristides Gionis , Dan Klein , Piotr Indyk, Evaluating strategies for similarity search on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511502]
|
| |
17
|
T. H. Haveliwala, S. Kamvar, and G. Jeh. An analytical comparison of approaches to personalizing PageRank. Technical report, Stanford University, 2003.
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
 |
21
|
|
 |
22
|
|
| |
23
|
Wangzhong Lu , Jeannette Janssen , Evangelos Milios , Nathalie Japkowicz, Node similarity in networked information spaces, Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research, p.11, November 05-07, 2001, Toronto, Ontario, Canada
|
| |
24
|
|
| |
25
|
|
| |
26
|
Open Directory Project (ODP). http://www.dmoz.org.
|
| |
27
|
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.
|
 |
28
|
|
| |
29
|
P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. Proc. of AAAI Fall Symposium on Using Uncertainty Within Computation, pages 121--128, 2001.
|
CITED BY 9
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tamás Sarlós , Adrás A. Benczúr , Károly Csalogány , Dániel Fogaras , Balázs Rácz, To randomize or not to randomize: space optimal summaries for hyperlink analysis, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
|
|
|
|
|
|
|
|
|
Luca Becchetti , Paolo Boldi , Carlos Castillo , Aristides Gionis, Efficient semi-streaming algorithms for local triangle counting in massive graphs, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|