ACM Home Page
Please provide us with feedback. Feedback
High-performance priority queues for parallel crawlers
Full text PdfPdf (467 KB)
Source
Workshop On Web Information And Data Management archive
Proceeding of the 10th ACM workshop on Web information and data management table of contents
Napa Valley, California, USA
SESSION: System issues table of contents
Pages 47-54  
Year of Publication: 2008
ISBN:978-1-60558-260-3
Authors
Mauricio Marin  Yahoo! Research Latin America, Santiago, Chile
Rodrigo Paredes  Yahoo! Research Latin America, Santiago, Chile
Carolina Bonacic  Complutense University of Madrid, Madrid, Spain
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 86,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458502.1458511
What is a DOI?

ABSTRACT

Large scale data centers for crawlers are able to maintain a very large number of active http connections in order to download as fast as possible the usually huge number of web pages from given sections of the WWW. This generates a continuous stream of new URLs of documents to be downloaded and it is clear that the associated work-load can only be served efficiently with proper parallel computing techniques. The incoming new URLs have to be organized by a priority measure in order to download the most relevant documents first. Efficiently managing them along with other synchronization issues such as URLs downloaded by different processing nodes forming a cluster of computers are the matters of this paper. We propose efficient and scalable strategies which consider intra-node multi-core multi-threading on an inter-nodes distributed memory environment, including efficient use of secondary memory.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
R. Baeza-Yates and C. Castillo. Balancing volume, quality and freshness in web crawling. In Soft Computing Systems - Design, Management and Applications, pages 565--572. IOS Press, 2002.
3
 
4
 
5
6
7
 
8
S. Dong, X. Lu, L. Zhang, and K. He. An efficient parallel crawler in grid environment. In Int. Conf. on Grid and Cooperative Computing, LNCS 3032, pages 229--232. Springer, 2004.
9
 
10
B. Thau Loo, S. Krishnamurthy, and O. Cooper. Distributed Web crawling over DHTs. Technical Report UCB/CSD-04-1305, EECS Department, University of California, Berkeley, Feb 2004.
 
11
M. Marin. Binary Tournaments and Priority Queues: PRAM and BSP. Technical Report PRG-TR-7-97, Oxford University, 1997.
 
12
 
13
R. Paredes. Graphs for Metric Space Searching. PhD thesis, Universidad de Chile, 2008. Advisor: G. Navarro. Tech Report TR/DCC-2008-10. Available at www.dcc.uchile.cl/~raparede/publ/08PhDthesis.pdf.
 
14
R. Paredes and G. Navarro. Optimal incremental sorting. In Proc. 8th Workshop on Algorithm Engineering and Experiments and 3rd Workshop on Analytic Algorithmics and Combinatorics, pages 171--182. SIAM Press, 2006.
15

Collaborative Colleagues:
Mauricio Marin: colleagues
Rodrigo Paredes: colleagues
Carolina Bonacic: colleagues