ACM Home Page
Please provide us with feedback. Feedback
Stanford WebBase components and applications
Full text PdfPdf (609 KB)
Source ACM Transactions on Internet Technology (TOIT) archive
Volume 6 ,  Issue 2  (May 2006) table of contents
Pages: 153 - 186  
Year of Publication: 2006
ISSN:1533-5399
Authors
Junghoo Cho  Stanford University, Los Angeles, CA
Hector Garcia-Molina  Stanford University, Stanford, CA
Taher Haveliwala  Stanford University, Mountain View, CA
Wang Lam  Stanford University, Mountain View, CA
Andreas Paepcke  Stanford University, Stanford, CA
Sriram Raghavan  Stanford University, San Jose, CA
Gary Wesley  Stanford University, Stanford, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 141,   Citation Count: 13
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1149121.1149124
What is a DOI?

ABSTRACT

We describe the design and performance of WebBase, a tool for Web research. The system includes a highly customizable crawler, a repository for collected Web pages, an indexer for both text and link-related page features, and a high-speed content distribution facility. The distribution module enables researchers world-wide to retrieve pages from WebBase, and stream them across the Internet at high speed. The advantage for the researchers is that they need not all crawl the Web before beginning their research. WebBase has been used by scores of research and teaching organizations world-wide, mostly for investigations into Web topology and linguistic content analysis. After describing the system's architecture, we explain our engineering decisions for each of the WebBase components, and present respective performance measurements.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
 
5
 
6
Burner, M. 1998. Crawling towards eternity: Building an archive of the world wide Web. Web Techniq. Mag. 2, 5 (May).
7
 
8
 
9
 
10
11
12
13
 
14
15
 
16
Coffman, Jr., E., Liu, Z., and Weber, R. R. 1997. Optimal robot scheduling for Web search engines. Tech. rep. INRIA, Rocquencourt, France.
 
17
 
18
Eichmann, D. 1994. The RBSE spider: Balancing effective search against Web load. In Proceedings of the WWW Conference.
 
19
 
20
Gorssman, D. A. and Driscoll, J. R. 1992. Structuring text within a relation system. In Proceedings of the 3rd International Conference on Database and Expert System Applications. 72--77.
 
21
Haveliwala, T. H., Gionis, A., and Indyk, P. 2000. Scalable techniques for clustering the Web. In Proceedings of the 3rd International Workshop on the Web and Databases ( WebDB ).
22
 
23
 
24
 
25
 
26
Koster, M. 1994. A standard for robot exclusion. Available online at http://www.robotstxt.org/wc/norobots.html.
 
27
 
28
McBryan, O. A. 1994. GENVL and WWWW: Tools for taming the Web. In Proceedings of the WWW Conference.
29
 
30
31
32
 
33
Olson, M., Bostic, K., and Seltzer, M. 1999. Berkeley DB. In Proceedings of the 1999 Summer Usenix Technical Conference.
 
34
Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In Proceedings of the WWW Conference.
 
35
Raghavan, S. 2003. Complex queries over Web repositories. In Proceedings of the 29th Conference on Very Large Databases (VLDB).
 
36
Raghavan, S. and Garcia-Molina, H. 2003. Representing Web graphs. In Proceedings of the 19th International Conference on Data Engineering.
37
38
 
39
40
41
 
42
 
43

CITED BY  13


REVIEW

"Jie Tang : Reviewer"

Stanford WebBase, a Web search and retrieval tool, has been used by scores of research and teaching organizations, mostly for investigations into Web topology and linguistic content analysis. This paper describes the WebBase system, presenting its  more...

Collaborative Colleagues:
Junghoo Cho: colleagues
Hector Garcia-Molina: colleagues
Taher Haveliwala: colleagues
Wang Lam: colleagues
Andreas Paepcke: colleagues
Sriram Raghavan: colleagues
Gary Wesley: colleagues