| Sitemaps: above and beyond the crawl of duty |
| Full text |
Pdf
(1.10 MB)
|
Source
|
International World Wide Web Conference
archive
Proceedings of the 18th international conference on World wide web
table of contents
Madrid, Spain
SESSION: XML and web data/session: XML extraction and crawling
table of contents
Pages 991-1000
Year of Publication: 2009
ISBN:978-1-60558-487-4
|
|
Authors
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 17, Downloads (12 Months): 106, Citation Count: 1
|
|
|
ABSTRACT
Comprehensive coverage of the public web is crucial to web search engines. Search engines use crawlers to retrieve pages and then discover new ones by extracting the pages' outgoing links. However, the set of pages reachable from the publicly linked web is estimated to be significantly smaller than the invisible web, the set of documents that have no incoming links and can only be retrieved through web applications and web forms. The Sitemaps protocol is a fast-growing web protocol supported jointly by major search engines to help content creators and search engines unlock this hidden data by making it available to search engines. In this paper, we perform a detailed study of how "classic" discovery crawling compares with Sitemaps, in key measures such as coverage and freshness over key representative websites as well as over billions of URLs seen at Google. We observe that Sitemaps and discovery crawling complement each other very well, and offer different tradeoffs.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
 |
3
|
|
 |
4
|
|
| |
5
|
M.K. Bergman. The Deep Web: Surfacing hidden value. Journal of Electronic Publishing, 7(1):07--01, 2001.
|
 |
6
|
|
 |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
|
 |
14
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
| |
15
|
Google. Joint support for the Sitemap protocol. Available online at: http://googlewebmastercentral.blogspot.com/2006/11/joint-support-for-sitemap-protocol.html, 2006.
|
| |
16
|
Google. Retiring support for OAI. Available online at: http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html, 2008.
|
| |
17
|
Google. We knew the web was big... Available online at: http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html, 2008.
|
| |
18
|
Microsoft Google, Yahoo. Sitemaps.org. Available online at: http://sitemaps.org, 2008.
|
| |
19
|
J. Gray. A conversation with Werner Vogels. Available online at: http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=388, 2006.
|
| |
20
|
Open Archive Initiative. Open archive. Available online at: http://www.openarchives.org, 2008.
|
| |
21
|
|
 |
22
|
|
 |
23
|
|
 |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
P. Ranjan and N. Shivakumar. Sitemaps: A content discovery protocol for the Web. In Proc. 17th WWW, 2008.
|
| |
28
|
Reuters. Google, 4 states partner on government info search. Available online at: http://www.reuters.com/article/domesticNews/idUSN2946293620070430?sp=true, 2007.
|
 |
29
|
J. L. Wolf , M. S. Squillante , P. S. Yu , J. Sethuraman , L. Ozsen, Optimal crawling strategies for web search engines, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511465]
|
 |
30
|
|
|