ACM Home Page
Please provide us with feedback. Feedback
Efficient data reduction with EASE
Full text PdfPdf (420 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Washington, D.C.
SESSION: Research track table of contents
Pages: 59 - 68  
Year of Publication: 2003
ISBN:1-58113-737-0
Authors
Hervé Brönnimann  Comp & Info Sci Polytechnic Univ., Brooklyn, NY
Bin Chen  Exelixis Inc., San Francisco, CA
Manoranjan Dash  Elect & Comp Engg, Evanston, IL, Northwestern Univ.
Peter Haas  IBM Almaden, San Jose, CA
Peter Scheuermann  Elect & Comp Engg, Evanston, IL, Northwestern Univ.
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 46,   Citation Count: 10
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956750.956761
What is a DOI?

ABSTRACT

A variety of mining and analysis problems --- ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes --- involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" --- appropriately defined --- from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical χ2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
N. Alon and J. H. Spencer. The probabilistic method. Wiley Interscience, New York, 1992.
 
4
 
5
H. Bronnimann, B. Chen, M. Dash, and P. Haas, P. Scheuermann. Efficient data reduction method for on-line association rule discovery. In Proceedings of NSF Workshop on New Generation of Data Mining, 2002.
 
6
7
 
8
9
 
10
G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proceedings of International Conference on Very Large Databases (VLDB), 2002.
 
11
 
12
13
 
14

CITED BY  10

Collaborative Colleagues:
Hervé Brönnimann: colleagues
Bin Chen: colleagues
Manoranjan Dash: colleagues
Peter Haas: colleagues
Peter Scheuermann: colleagues