ACM Home Page
Please provide us with feedback. Feedback
Extraction and classification of dense implicit communities in the Web graph
Full text PdfPdf (680 KB)
Source
ACM Transactions on the Web (TWEB) archive
Volume 3 ,  Issue 2  (April 2009) table of contents
Article No. 7  
Year of Publication: 2009
ISSN:1559-1131
Authors
Yon Dourisboure  LIUPPA - Université de Pau et des Pays de l'Adour, Pau Cedex, France
Filippo Geraci  Istituto di Informatica e Telematica—CNR, Pisa, Italy
Marco Pellegrini  Istituto di Informatica e Telematica—CNR, Pisa, Italy
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 39,   Downloads (12 Months): 284,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1513876.1513879
What is a DOI?

ABSTRACT

The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information, and services, and there is a growing interest in tools for understanding collective behavior and emerging phenomena in the WWW. In this article we focus on the problem of searching and classifying communities in the Web. Loosely speaking a community is a group of pages related to a common interest. More formally, communities have been associated in the computer science literature with the existence of a locally dense subgraph of the Web graph (where Web pages are nodes and hyperlinks are arcs of the Web graph). The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on Web graphs built on three publicly available large crawls of the Web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the Web graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100% for communities of thirty nodes or more (even at low density). It is still about 80% even for communities of twenty nodes with density over 50% of the arcs present. At the lower extremes the algorithm catches 35% of dense communities made of ten nodes. We also develop some sufficient conditions for the detection of a community under some local graph models and not-too-restrictive hypotheses. We complete our Community Watch system by clustering the communities found in the Web graph into homogeneous groups by topic and labeling each group by representative keywords.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
4
 
5
6
 
7
 
8
 
9
 
10
Capocci, A., Servedio, V. D. P., Caldarelli, G., and Colaiori, F. 2004. Communities detection in large networks. In Proceedings of the Algorithms and Models for the Web graph (WAW'04): Third International Workshop. 181--188.
 
11
 
12
Cho, J. and Garcia-Molina, H. 2000. WebBase and the Stanford interlib project. In Proceedings of the Kyoto International Conference on Digital Libraries: Research and Practice.
 
13
14
 
15
16
 
17
 
18
Feige, U., Peleg, D., and Kortsarz, G. 2001. The dense k-subgraph problem. Algorithmica 29, 3, 410--421.
19
 
20
 
21
Geraci, F., Pellegrini, M., Maggini, M., and Sebastiani, F. 2007. Cluster generation and labeling for web snippets:a fast, accurate hierarchical solution. Internet Math. 3, 4, 413--443.
22
 
23
 
24
Girvan, M. and Newman, M. E. J. 2002. Community structure in social and biological networks. Proceedings of the National Academic Science, 7821--7826.
25
26
 
27
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.
 
28
Hastad, J. 1999. Clique is hard to approximate within n1−ε. Acta Mathematica 182, 105--142.
 
29
Haveliwala, T. H., Gionis, A., and Indyk, P. 2000. Scalable techniques for clustering the web. In Proceedings of the WebDB Workshop. 129--134.
 
30
Henzinger, M. 2002. Algorithmic challenges in Web search engines. Internet Math. 1, 1, 115--126.
 
31
32
33
 
34
 
35
 
36
Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 2005. Method and system for trawling the world-wide Web to identify implicitly-defined communities of Web pages. US patent 6886129.
 
37
 
38
39
 
40
Newman, M. 2003. The structure and function of complex networks. SIAM Rev. 45, 2, 167--256.
 
41
 
42
Stamou, S., Ntoulas, A., Krikos, V., Kokosis, P., and Christodoulakis, D. 2006. Classifying web data in directory structures. In Frontiers of WWW Research and Development—APWeb, 8th Asia-Pacific Web Conference. Lecture Notes in Computer Science, vol. 3841. Springer, 238--249.
43

Collaborative Colleagues:
Yon Dourisboure: colleagues
Filippo Geraci: colleagues
Marco Pellegrini: colleagues