| Efficient data reduction with EASE |
| Full text |
Pdf
(420 KB)
|
| Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Washington, D.C.
SESSION: Research track
table of contents
Pages: 59 - 68
Year of Publication: 2003
ISBN:1-58113-737-0
|
|
Authors
|
|
Hervé Brönnimann
|
Comp & Info Sci Polytechnic Univ., Brooklyn, NY
|
|
Bin Chen
|
Exelixis Inc., San Francisco, CA
|
|
Manoranjan Dash
|
Elect & Comp Engg, Evanston, IL, Northwestern Univ.
|
|
Peter Haas
|
IBM Almaden, San Jose, CA
|
|
Peter Scheuermann
|
Elect & Comp Engg, Evanston, IL, Northwestern Univ.
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 7, Downloads (12 Months): 46, Citation Count: 10
|
|
|
ABSTRACT
A variety of mining and analysis problems --- ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes --- involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" --- appropriately defined --- from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical χ2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
2
|
|
| |
3
|
N. Alon and J. H. Spencer. The probabilistic method. Wiley Interscience, New York, 1992.
|
| |
4
|
|
| |
5
|
H. Bronnimann, B. Chen, M. Dash, and P. Haas, P. Scheuermann. Efficient data reduction method for on-line association rule discovery. In Proceedings of NSF Workshop on New Generation of Data Mining, 2002.
|
| |
6
|
|
 |
7
|
|
| |
8
|
Jim Gray , Adam Bosworth , Andrew Layman , Hamid Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total, Proceedings of the Twelfth International Conference on Data Engineering, p.152-159, February 26-March 01, 1996
|
 |
9
|
Jiawei Han , Jian Pei , Yiwen Yin, Mining frequent patterns without candidate generation, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.1-12, May 15-18, 2000, Dallas, Texas, United States
|
| |
10
|
G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proceedings of International Conference on Very Large Databases (VLDB), 2002.
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
|
CITED BY 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Surong Wang , Manoranjan Dash , Liang-Tien Chia , Min Xu, Efficient sampling of training set in large and noisy multimedia data, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), v.3 n.3, p.14-es, August 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|