|
ABSTRACT
The Web has been rapidly "deepened" by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this "deep Web" of searchable databses is clearly an important frontier for data access. This paper surveys this relatively unexplored frontier, measuring characteristics pertinent to both exploring and integrating structured Web sources. On one hand, our "macro" study surveys the deep Web at large, in April 2004, adopting the random IP-sampling approach, with one million samples. (How large is the deep Web? How is it covered by current directory services?) On the other hand, our "micro" study surveys source-specific characteristics over 441 sources in eight representative domains, in December 2002. (How "hidden" are deep-Web sources? How do search engines cover their data? How complex and expressive are query forms?) We report our observations and publish the resulting datasets to the research community. We conclude with several implications (of our own) which, while necessarily subjective, might help shape research directions and solutions.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
BrightPlanet.com. The deep web: Surfacing hidden value. Accessible at http://brightplanet.com, July 2000.
|
| |
2
|
Steve Lawrence and C. Lee Giles. Accessibility of information on the web. Nature, 400(6740):107--109, 1999.
|
| |
3
|
Ed O'Neill, Brian Lavoie, and Rick Bennett. Web characterization. Accessible at "http://wcp.oclc.org".
|
| |
4
|
GNU. wget. Accessible at "http://www.gnu.org/software/wget/wget.html".
|
| |
5
|
Kevin Chen-Chuan Chang, Bin He, Chengkai Li, and Zhen Zhang. The UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository, 2003.
|
| |
6
|
G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, Massachusetts, 1949.
|
| |
7
|
William W. Cohen. Some practical observations on integration of web information. In WebDB (Informal Proceedings), pages 55--60, 1999.
|
| |
8
|
|
 |
9
|
|
 |
10
|
Panagiotis G. Ipeirotis , Luis Gravano , Mehran Sahami, Probe, count, and classify: categorizing hidden web databases, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.67-78, May 21-24, 2001, Santa Barbara, California, United States
|
 |
11
|
Jamie Callan , Margaret Connell , Aiqun Du, Automatic discovery of language models for text databases, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.479-490, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
12
|
|
| |
13
|
|
| |
14
|
Weiyi Meng , King-Lup Liu , Clement T. Yu , Xiaodong Wang , Yuhsi Chang , Naphtali Rishe, Determining Text Databases to Search in the Internet, Proceedings of the 24rd International Conference on Very Large Data Bases, p.14-25, August 24-27, 1998
|
| |
15
|
|
 |
16
|
|
 |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
 |
21
|
Renée J. Miller , Mauricio A. Hernández , Laura M. Haas , Lingling Yan , C. T. Howard Ho , Ronald Fagin , Lucian Popa, The Clio project: managing heterogeneity, ACM SIGMOD Record, v.30 n.1, p.78-83, March 2001
[doi> 10.1145/373626.373713]
|
| |
22
|
Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. STARTS: Stanford protocol proposal for internet retrieval and search. Accessible at http://www-db.stanford.edu/~gravano/starts.html, August 1996.
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
 |
28
|
D. W. Embley , Y. Jiang , Y.-K. Ng, Record-boundary discovery in Web documents, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.467-478, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
CITED BY 21
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ronak Desai , Qi Yang , Zonghuan Wu , Weiyi Meng , Clement Yu, Identifying redundant search engines in a very large scale metasearch engine context, Proceedings of the eighth ACM international workshop on Web information and data management, November 10-10, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
Manuel Álvarez , Juan Raposo , Alberto Pan , Fidel Cacheda , Fernando Bellas , Víctor Carneiro, DeepBot: a focused crawler for accessing hidden web content, Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07), p.18-25, June 12-12, 2007, San Diego, California
|
|
|
|
|
|
|
|
|
Gengxin Miao , Junichi Tatemura , Wang-Pin Hsiung , Arsany Sawires , Louise E. Moser, Extracting data records from the web using tag path clustering, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|