ACM Home Page
Please provide us with feedback. Feedback
A random walk approach to sampling hidden databases
Full text PdfPdf (470 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2007 ACM SIGMOD international conference on Management of data table of contents
Beijing, China
SESSION: Search table of contents
Pages: 629 - 640  
Year of Publication: 2007
ISBN:978-1-59593-686-8
Authors
Arjun Dasgupta  University of Texas at Arlington, Arlington, TX
Gautam Das  University of Texas at Arlington, Arlington, TX
Heikki Mannila  Helsinki University of Technology and University of Helsinki, Helsinki, Finland
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 28,   Downloads (12 Months): 189,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1247480.1247550
What is a DOI?

ABSTRACT

A large part of the data on the World Wide Web is hidden behind form-like interfaces. These interfaces interact with a hidden back-end database to provide answers to user queries. Generating a uniform random sample of this hidden database by using only the publicly available interface gives us access to the underlying data distribution. In this paper, we propose a random walk scheme over the query space provided by the interface to sample such databases. We discuss variants where the query space is visualized as a fixed and random ordering of attributes. We also propose techniques to further improve the sample quality by using a probabilistic rejection based approach. We conduct extensive experiments to illustrate the accuracy and efficiency of our techniques.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
3
4
 
5
 
6
 
7
W. Hastings. Monte Carlo Sampling Methods using Markov Chains and their Applications. Biometrika, 57 (1), 1970, 97--109.
8
9
 
10
S. Lawrence and C. L. Giles. Accessibility of Information on the Web. Nature, 400, 1999. 107--109.
 
11
S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280 (5360), 1998.
 
12
N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equations of State Calculations by Fast Computing Machines. J. of Chemical Physics, 21, 1953, 1087--1091.
 
13
J. Von Neumann. Various Techniques used in Connection with Random Digits. In John von Neumann, Collected Works, Volume V. Oxford, 1963.
 
14
F. Olken. Random Sampling from Databases. PhD Thesis, University of California, Berkeley, 1993.
15
 
16
17


Collaborative Colleagues:
Arjun Dasgupta: colleagues
Gautam Das: colleagues
Heikki Mannila: colleagues