| Stanford WebBase components and applications |
| Full text |
Pdf
(609 KB)
|
| Source
|
ACM Transactions on Internet Technology (TOIT)
archive
Volume 6 , Issue 2 (May 2006)
table of contents
Pages: 153 - 186
Year of Publication: 2006
ISSN:1533-5399
|
|
Authors
|
|
Junghoo Cho
|
Stanford University, Los Angeles, CA
|
|
Hector Garcia-Molina
|
Stanford University, Stanford, CA
|
|
Taher Haveliwala
|
Stanford University, Mountain View, CA
|
|
Wang Lam
|
Stanford University, Mountain View, CA
|
|
Andreas Paepcke
|
Stanford University, Stanford, CA
|
|
Sriram Raghavan
|
Stanford University, San Jose, CA
|
|
Gary Wesley
|
Stanford University, Stanford, CA
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 4, Downloads (12 Months): 126, Citation Count: 13
|
|
|
ABSTRACT
We describe the design and performance of WebBase, a tool for Web research. The system includes a highly customizable crawler, a repository for collected Web pages, an indexer for both text and link-related page features, and a high-speed content distribution facility. The distribution module enables researchers world-wide to retrieve pages from WebBase, and stream them across the Internet at high speed. The advantage for the researchers is that they need not all crawl the Web before beginning their research. WebBase has been used by scores of research and teaching organizations world-wide, mostly for investigations into Web topology and linguistic content analysis. After describing the system's architecture, we explain our engineering decisions for each of the WebBase components, and present respective performance measurements.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
T. E. Anderson , M. D. Dahlin , J. M. Neefe , D. A. Patterson , D. S. Roselli , R. Y. Wang, Serverless network file systems, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.109-126, December 03-06, 1995, Copper Mountain, Colorado, United States
|
 |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
Burner, M. 1998. Crawling towards eternity: Building an archive of the world wide Web. Web Techniq. Mag. 2, 5 (May).
|
 |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
 |
11
|
|
 |
12
|
|
 |
13
|
|
| |
14
|
|
 |
15
|
Junghoo Cho , Narayanan Shivakumar , Hector Garcia-Molina, Finding replicated Web collections, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.355-366, May 15-18, 2000, Dallas, Texas, United States
|
| |
16
|
Coffman, Jr., E., Liu, Z., and Weber, R. R. 1997. Optimal robot scheduling for Web search engines. Tech. rep. INRIA, Rocquencourt, France.
|
| |
17
|
|
| |
18
|
Eichmann, D. 1994. The RBSE spider: Balancing effective search against Web load. In Proceedings of the WWW Conference.
|
| |
19
|
|
| |
20
|
Gorssman, D. A. and Driscoll, J. R. 1992. Structuring text within a relation system. In Proceedings of the 3rd International Conference on Database and Expert System Applications. 72--77.
|
| |
21
|
Haveliwala, T. H., Gionis, A., and Indyk, P. 2000. Scalable techniques for clustering the Web. In Proceedings of the 3rd International Workshop on the Web and Databases ( WebDB ).
|
 |
22
|
Taher H. Haveliwala , Aristides Gionis , Dan Klein , Piotr Indyk, Evaluating strategies for similarity search on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511502]
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
Koster, M. 1994. A standard for robot exclusion. Available online at http://www.robotstxt.org/wc/norobots.html.
|
| |
27
|
|
| |
28
|
McBryan, O. A. 1994. GENVL and WWWW: Tools for taming the Web. In Proceedings of the WWW Conference.
|
 |
29
|
Sergey Melnik , Sriram Raghavan , Beverly Yang , Hector Garcia-Molina, Building a distributed full-text index for the Web, Proceedings of the 10th international conference on World Wide Web, p.396-406, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372095]
|
| |
30
|
|
 |
31
|
|
 |
32
|
|
| |
33
|
Olson, M., Bostic, K., and Seltzer, M. 1999. Berkeley DB. In Proceedings of the 1999 Summer Usenix Technical Conference.
|
| |
34
|
Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In Proceedings of the WWW Conference.
|
| |
35
|
Raghavan, S. 2003. Complex queries over Web repositories. In Proceedings of the 29th Conference on Very Large Databases (VLDB).
|
| |
36
|
Raghavan, S. and Garcia-Molina, H. 2003. Representing Web graphs. In Proceedings of the 19th International Conference on Data Engineering.
|
 |
37
|
Berthier Ribeiro-Neto , Edleno S. Moura , Marden S. Neubert , Nivio Ziviani, Efficient distributed algorithms to build inverted files, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.105-112, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312663]
|
 |
38
|
|
| |
39
|
|
 |
40
|
Anthony Tomasic , Héctor García-Molina , Kurt Shoens, Incremental updates of inverted lists for text document retrieval, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.289-300, May 24-27, 1994, Minneapolis, Minnesota, United States
|
 |
41
|
|
| |
42
|
|
| |
43
|
|
CITED BY 13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Aydin Buluç , Jeremy T. Fineman , Matteo Frigo , John R. Gilbert , Charles E. Leiserson, Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks, Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, August 11-13, 2009, Calgary, AB, Canada
|
|
|
Lan Nie , Brian D. Davison , Baoning Wu, From whence does your authority come?: utilizing community relevance in ranking, Proceedings of the 22nd national conference on Artificial intelligence, p.1421-1426, July 22-26, 2007, Vancouver, British Columbia, Canada
|
REVIEW
"Jie Tang : Reviewer"
Stanford WebBase, a Web search and retrieval tool, has been used by scores of research and teaching organizations, mostly for investigations into Web topology and linguistic content analysis. This paper describes the WebBase system, presenting its
more...
|