| A random walk approach to sampling hidden databases |
| Full text |
Pdf
(470 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
table of contents
Beijing, China
Pages: 629 - 640
Year of Publication: 2007
ISBN:978-1-59593-686-8
|
|
Authors
|
|
Arjun Dasgupta
|
University of Texas at Arlington, Arlington, TX
|
|
Gautam Das
|
University of Texas at Arlington, Arlington, TX
|
|
Heikki Mannila
|
Helsinki University of Technology and University of Helsinki, Helsinki, Finland
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 28, Downloads (12 Months): 189, Citation Count: 4
|
|
|
ABSTRACT
A large part of the data on the World Wide Web is hidden behind form-like interfaces. These interfaces interact with a hidden back-end database to provide answers to user queries. Generating a uniform random sample of this hidden database by using only the publicly available interface gives us access to the underlying data distribution. In this paper, we propose a random walk scheme over the query space provided by the interface to sample such databases. We discuss variants where the query space is visualized as a fixed and random ordering of attributes. We also propose techniques to further improve the sample quality by using a probabilistic rejection based approach. We conduct extensive experiments to illustrate the accuracy and efficiency of our techniques.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
 |
3
|
Jamie Callan , Margaret Connell , Aiqun Du, Automatic discovery of language models for text databases, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.479-490, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
W. Hastings. Monte Carlo Sampling Methods using Markov Chains and their Applications. Biometrika, 57 (1), 1970, 97--109.
|
 |
8
|
|
 |
9
|
Panagiotis G. Ipeirotis , Luis Gravano , Mehran Sahami, Probe, count, and classify: categorizing hidden web databases, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.67-78, May 21-24, 2001, Santa Barbara, California, United States
|
| |
10
|
S. Lawrence and C. L. Giles. Accessibility of Information on the Web. Nature, 400, 1999. 107--109.
|
| |
11
|
S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280 (5360), 1998.
|
| |
12
|
N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equations of State Calculations by Fast Computing Machines. J. of Chemical Physics, 21, 1953, 1087--1091.
|
| |
13
|
J. Von Neumann. Various Techniques used in Connection with Random Digits. In John von Neumann, Collected Works, Volume V. Oxford, 1963.
|
| |
14
|
F. Olken. Random Sampling from Databases. PhD Thesis, University of California, Berkeley, 1993.
|
 |
15
|
|
| |
16
|
|
 |
17
|
|
CITED BY 4
|
|
|
|
|
Anirban Maiti , Arjun Dasgupta , Nan Zhang , Gautam Das, HDSampler: revealing data behind web form interfaces, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
Arjun Dasgupta , Nan Zhang , Gautam Das , Surajit Chaudhuri, Privacy preservation of aggregates in hidden databases: why and how?, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|