ACM Home Page
Please provide us with feedback. Feedback
Design of a next generation sampling service for large scale data analysis applications
Full text PdfPdf (490 KB)
Source International Conference on Supercomputing archive
Proceedings of the 19th annual international conference on Supercomputing table of contents
Cambridge, Massachusetts
SESSION: Session 3: sampling table of contents
Pages: 91 - 100  
Year of Publication: 2005
ISBN:1-59593-167-8
Authors
H. Wang  The Ohio State University, Columbus, OH
S. Parthasarathy  The Ohio State University, Columbus, OH
A. Ghoting  The Ohio State University, Columbus, OH
S. Tatikonda  The Ohio State University, Columbus, OH
G. Buehrer  The Ohio State University, Columbus, OH
T. Kurc  The Ohio State University, Columbus, OH
J. Saltz  The Ohio State University, Columbus, OH
Sponsor
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 46,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1088149.1088162
What is a DOI?

ABSTRACT

Advances in data collection and storage technologies have resulted in large and dynamically growing data sets at many organizations. Database and data mining researchers often use sampling with great effect to scale up performance on these data sets with small cost to accuracy. However, existing techniques often ignore the cost of computing a sample. This cost is often linear in the size of the data set, not the sample, which is expensive. Furthermore, for data mining applications that leverage progressive sampling or bootstrapping-based techniques, this cost can be prohibitive, since they require the generation of multiple samples.To address this problem, we present a solution in the context of a state-of-the-art data analysis center. Specifically, we propose a scalable service that supports sample generation with cost linear in the size of the sample. We then present an efficient parallelization of this service. Our solution leverages high speed interconnects (e.g. Myrinet, Infini-band) for parallel I/O operations with pipelined data transfers. We export an interface that supports both ad-hoc SQL-like querying for database applications, as well as a stand-alone service for data mining applications. We then evaluate our work using queries abstracted from a network monitoring and analysis application, which uses both database and progressive sampling queries. We demonstrate that our implementation achieves good load balance and realizes up to an order of magnitude speedup when compared with extant approaches.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
P. Carns, W. Ligon, R. Ross, and R. Thakur. Pvfs: A parallel file system for linux clusters. In Proceedings of the Annual Linux Showcase and Conference, 2000.
 
4
 
5
A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke. The Data Grid: Towards an Architecture For the Distributed Management and Analysis of Large Scientific Datasets, 2001.
6
 
7
8
 
9
 
10
11
 
12
 
13
 
14
15
 
16
G. John and P. Langley. Static versus dynamic sampling for data mining. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1996.
 
17
David Kotz. Disk-directed I/O for MIMD multiprocessors. In Proceedings of the 1994 Symposium on Operating Systems Design and Implementation, pages 61--74. ACM Press, November 1994.
 
18
 
19
 
20
 
21
22
 
23
 
24
J. Pan, C. Faloutsos, and S. Seshan. Fastcars: Fast, correlation-aware sampling for network data mining. In Proceeding of the IEEE GlobeCom Global Internet Symposium, 2002.
 
25
26
27
 
28
29
 
30
 
31
 
32
33
 
34
M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1997.

Collaborative Colleagues:
H. Wang: colleagues
S. Parthasarathy: colleagues
A. Ghoting: colleagues
S. Tatikonda: colleagues
G. Buehrer: colleagues
T. Kurc: colleagues
J. Saltz: colleagues