| High-performance priority queues for parallel crawlers |
| Full text |
Pdf
(467 KB)
|
Source
|
Workshop On Web Information And Data Management
archive
Proceeding of the 10th ACM workshop on Web information and data management
table of contents
Napa Valley, California, USA
SESSION: System issues
table of contents
Pages 47-54
Year of Publication: 2008
ISBN:978-1-60558-260-3
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 5, Downloads (12 Months): 86, Citation Count: 0
|
|
|
ABSTRACT
Large scale data centers for crawlers are able to maintain a very large number of active http connections in order to download as fast as possible the usually huge number of web pages from given sections of the WWW. This generates a continuous stream of new URLs of documents to be downloaded and it is clear that the associated work-load can only be served efficiently with proper parallel computing techniques. The incoming new URLs have to be organized by a priority measure in order to download the most relevant documents first. Efficiently managing them along with other synchronization issues such as URLs downloaded by different processing nodes forming a cluster of computers are the matters of this paper. We propose efficient and scalable strategies which consider intra-node multi-core multi-threading on an inter-nodes distributed memory environment, including efficient use of secondary memory.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
R. Baeza-Yates and C. Castillo. Balancing volume, quality and freshness in web crawling. In Soft Computing Systems - Design, Management and Applications, pages 565--572. IOS Press, 2002.
|
 |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
Duen Horng Chau , Shashank Pandit , Samuel Wang , Christos Faloutsos, Parallel crawling for online social networks, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242809]
|
 |
7
|
|
| |
8
|
S. Dong, X. Lu, L. Zhang, and K. He. An efficient parallel crawler in grid environment. In Int. Conf. on Grid and Cooperative Computing, LNCS 3032, pages 229--232. Springer, 2004.
|
 |
9
|
|
| |
10
|
B. Thau Loo, S. Krishnamurthy, and O. Cooper. Distributed Web crawling over DHTs. Technical Report UCB/CSD-04-1305, EECS Department, University of California, Berkeley, Feb 2004.
|
| |
11
|
M. Marin. Binary Tournaments and Priority Queues: PRAM and BSP. Technical Report PRG-TR-7-97, Oxford University, 1997.
|
| |
12
|
|
| |
13
|
R. Paredes. Graphs for Metric Space Searching. PhD thesis, Universidad de Chile, 2008. Advisor: G. Navarro. Tech Report TR/DCC-2008-10. Available at www.dcc.uchile.cl/~raparede/publ/08PhDthesis.pdf.
|
| |
14
|
R. Paredes and G. Navarro. Optimal incremental sorting. In Proc. 8th Workshop on Algorithm Engineering and Experiments and 3rd Workshop on Analytic Algorithmics and Combinatorics, pages 171--182. SIAM Press, 2006.
|
 |
15
|
|
|