ACM Home Page
Please provide us with feedback. Feedback
PROXIMUS: a framework for analyzing very high dimensional discrete-attributed datasets
Full text PdfPdf (283 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Washington, D.C.
SESSION: Research track table of contents
Pages: 147 - 156  
Year of Publication: 2003
ISBN:1-58113-737-0
Authors
Mehmet Koyutürk  West Lafayette, IN
Ananth Grama  West Lafayette, IN
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 32,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956750.956770
What is a DOI?

ABSTRACT

This paper presents an efficient framework for error-bounded compression of high-dimensional discrete attributed datasets. Such datasets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these datasets. PROXIMUS provides a technique for reducing large datasets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large datasets, and performance in terms of capability to represent data in a compact form. We also demonstrate applications of PROXIMUS in association rule mining. In doing so, we establish PROXIMUS as a tool for preprocessing data before applying computationally expensive algorithms or as a tool for directly extracting correlated patterns. Our experimental results show that use of the compressed data for association rule mining provides excellent precision and recall values (near 100%) across a range of support thresholds while reducing the time required for association rule mining drastically.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
IBM Quest synthetic data generation code. http://www.almaden.ibm.com/cs/quest/syndata.html.
 
2
 
3
 
4
 
5
C. Borgelt. Finding association rules/hyperedges with the apriori algorithm. http://fuzzy.cs.Uni-Magdeburg.de/ borgelt/apriori/apriori.html, 1996.
 
6
 
7
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39 (1):1--38, 1977.
 
8
 
9
R. M. Gray. Vector quantization. IEEE ASSP, 1(2):4--29, 1984.
 
10
 
11
G. Gupta and J. Ghosh. Value balanced agglomerative connectivity clustering. In SPIE Proc., April 2001.
 
12
E. Han, G. Karypis, V. Kumar, and B. Mobasher. Hypergraph-based clustering in high-dimensional datasets: A summary of results. Bulletin of the Technical Committee on Data Engineering, 21(1), 1998.
 
13
Z. Huang. A fast clustering algorithm to cluster very large categorical data sets in data mining. In Research Issues on Data Mining and Knowledge Discovery, 1997.
 
14
G. H. John and P. Langley. Static versus dynamic sampling for data mining. In E. Simoudis, J. Han, and U. M. Fayyad, editors, Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining, KDD, pages 367--370. AAAI Press, 2--4 1996.
 
15
16
 
17
T. G. Kolda and D. P. O'Leary. Computation and uses of the semidiscrete matrix decomposition. ACM Transactions on Information Processing, 1999.
 
18
 
19
G. H. Lincoff. Mushroom Records Drawn From The Audubon Society Field Guide to North American Mushrooms. Alfred A. Knopf, New York, 1981.
 
20
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symposium, volume 1, pages 281--297, 1967.
 
21
S. McConnell and D. B. Skillicorn. Outlier detection using semi-discrete decomposition. Technical Report 2001--452, Dept. of Computing and Information Science, Queen's University, 2001.
 
22
D. P. O'Leary and S. Peleg. Digital image compression by outer product expansion. IEEE Trans. on Communications, 31:441--444, 1983.
 
23
M. Özdal and C. Aykanat. Clustering based on data patterns using hypergraph models. to be published in Data Mining and Knowledge Discovery, 2003.
 
24
 
25
 
26
 
27
S. Zyto, A. Grama, and W. Szpankowski. Semi-discrete matrix transforms (SDD) for image and video compression. In D. Marinescu and C. Lee, editors, Process Coordination and Ubiquitous Computing, pages 249--259. Kluwer, 2002.


Collaborative Colleagues:
Mehmet Koyutürk: colleagues
Ananth Grama: colleagues