|
ABSTRACT
The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information, and services, and there is a growing interest in tools for understanding collective behavior and emerging phenomena in the WWW. In this article we focus on the problem of searching and classifying communities in the Web. Loosely speaking a community is a group of pages related to a common interest. More formally, communities have been associated in the computer science literature with the existence of a locally dense subgraph of the Web graph (where Web pages are nodes and hyperlinks are arcs of the Web graph). The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on Web graphs built on three publicly available large crawls of the Web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the Web graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100% for communities of thirty nodes or more (even at low density). It is still about 80% even for communities of twenty nodes with density over 50% of the arcs present. At the lower extremes the algorithm catches 35% of dense communities made of ten nodes. We also develop some sufficient conditions for the detection of a community under some local graph models and not-too-restrictive hypotheses. We complete our Community Watch system by clustering the communities found in the Web graph into homogeneous groups by topic and labeling each group by representative keywords.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
[doi> 10.1145/900051.900060]
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Computer Networks: The International Journal of Computer and Telecommunications Networking, v.33 n.1-6, p.309-320, June 2000
|
| |
8
|
|
| |
9
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
10
|
Capocci, A., Servedio, V. D. P., Caldarelli, G., and Colaiori, F. 2004. Communities detection in large networks. In Proceedings of the Algorithms and Models for the Web graph (WAW'04): Third International Workshop. 181--188.
|
| |
11
|
Soumen Chakrabarti , Byron E. Dom , S. Ravi Kumar , Prabhakar Raghavan , Sridhar Rajagopalan , Andrew Tomkins , David Gibson , Jon Kleinberg, Mining the Web's Link Structure, Computer, v.32 n.8, p.60-67, August 1999
[doi> 10.1109/2.781636]
|
| |
12
|
Cho, J. and Garcia-Molina, H. 2000. WebBase and the Stanford interlib project. In Proceedings of the Kyoto International Conference on Digital Libraries: Research and Practice.
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
 |
16
|
|
| |
17
|
|
| |
18
|
Feige, U., Peleg, D., and Kortsarz, G. 2001. The dense k-subgraph problem. Algorithmica 29, 3, 410--421.
|
 |
19
|
Gary William Flake , Steve Lawrence , C. Lee Giles, Efficient identification of Web communities, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.150-160, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347121]
|
| |
20
|
|
| |
21
|
Geraci, F., Pellegrini, M., Maggini, M., and Sebastiani, F. 2007. Cluster generation and labeling for web snippets:a fast, accurate hierarchical solution. Internet Math. 3, 4, 413--443.
|
 |
22
|
David Gibson , Jon Kleinberg , Prabhakar Raghavan, Inferring Web communities from link topology, Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems, p.225-234, June 20-24, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276627.276652]
|
| |
23
|
|
| |
24
|
Girvan, M. and Newman, M. E. J. 2002. Community structure in social and biological networks. Proceedings of the National Academic Science, 7821--7826.
|
 |
25
|
Eric J. Glover , Kostas Tsioutsiouliklis , Steve Lawrence , David M. Pennock , Gary W. Flake, Using web structure for classifying and describing web pages, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511520]
|
 |
26
|
|
| |
27
|
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.
|
| |
28
|
Hastad, J. 1999. Clique is hard to approximate within n1−ε. Acta Mathematica 182, 105--142.
|
| |
29
|
Haveliwala, T. H., Gionis, A., and Indyk, P. 2000. Scalable techniques for clustering the web. In Proceedings of the WebDB Workshop. 129--134.
|
| |
30
|
Henzinger, M. 2002. Algorithmic challenges in Web search engines. Internet Math. 1, 1, 115--126.
|
| |
31
|
|
 |
32
|
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 2005. Method and system for trawling the world-wide Web to identify implicitly-defined communities of Web pages. US patent 6886129.
|
| |
37
|
|
| |
38
|
|
 |
39
|
|
| |
40
|
Newman, M. 2003. The structure and function of complex networks. SIAM Rev. 45, 2, 167--256.
|
| |
41
|
|
| |
42
|
Stamou, S., Ntoulas, A., Krikos, V., Kokosis, P., and Christodoulakis, D. 2006. Classifying web data in directory structures. In Frontiers of WWW Research and Development—APWeb, 8th Asia-Pacific Web Conference. Lecture Notes in Computer Science, vol. 3841. Springer, 238--249.
|
 |
43
|
|
|