ACM Home Page
Please provide us with feedback. Feedback
Sitemaps: above and beyond the crawl of duty
Full text PdfPdf (1.10 MB)
Source
International World Wide Web Conference archive
Proceedings of the 18th international conference on World wide web table of contents
Madrid, Spain
SESSION: XML and web data/session: XML extraction and crawling table of contents
Pages 991-1000  
Year of Publication: 2009
ISBN:978-1-60558-487-4
Authors
Uri Schonfeld  UCLA Computer Science Department, Los Angeles, CA, USA
Narayanan Shivakumar  Google Inc., Mountain View, CA, USA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 106,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1526709.1526842
What is a DOI?

ABSTRACT

Comprehensive coverage of the public web is crucial to web search engines. Search engines use crawlers to retrieve pages and then discover new ones by extracting the pages' outgoing links. However, the set of pages reachable from the publicly linked web is estimated to be significantly smaller than the invisible web, the set of documents that have no incoming links and can only be retrieved through web applications and web forms. The Sitemaps protocol is a fast-growing web protocol supported jointly by major search engines to help content creators and search engines unlock this hidden data by making it available to search engines. In this paper, we perform a detailed study of how "classic" discovery crawling compares with Sitemaps, in key measures such as coverage and freshness over key representative websites as well as over billions of URLs seen at Google. We observe that Sitemaps and discovery crawling complement each other very well, and offer different tradeoffs.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
4
 
5
M.K. Bergman. The Deep Web: Surfacing hidden value. Journal of Electronic Publishing, 7(1):07--01, 2001.
6
7
 
8
9
 
10
11
 
12
 
13
14
 
15
Google. Joint support for the Sitemap protocol. Available online at: http://googlewebmastercentral.blogspot.com/2006/11/joint-support-for-sitemap-protocol.html, 2006.
 
16
Google. Retiring support for OAI. Available online at: http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html, 2008.
 
17
Google. We knew the web was big... Available online at: http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html, 2008.
 
18
Microsoft Google, Yahoo. Sitemaps.org. Available online at: http://sitemaps.org, 2008.
 
19
J. Gray. A conversation with Werner Vogels. Available online at: http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=388, 2006.
 
20
Open Archive Initiative. Open archive. Available online at: http://www.openarchives.org, 2008.
 
21
22
23
24
25
26
 
27
P. Ranjan and N. Shivakumar. Sitemaps: A content discovery protocol for the Web. In Proc. 17th WWW, 2008.
 
28
Reuters. Google, 4 states partner on government info search. Available online at: http://www.reuters.com/article/domesticNews/idUSN2946293620070430?sp=true, 2007.
29
30


Collaborative Colleagues:
Uri Schonfeld: colleagues
Narayanan Shivakumar: colleagues