|
ABSTRACT
With the increased abilities for automated data collection made possible by modern technology, the typical sizes of data collections have continued to grow in recent years. In such cases, it may be desirable to store the data in a reduced format in order to improve the storage, transfer time, and processing requirements on the data. One of the challenges of designing effective data compression techniques is to be able to preserve the ability to use the reduced format directly for a wide range of database and data mining applications. In this paper, we propose the novel idea of hierarchical subspace sampling in order to create a reduced representation of the data. The method is naturally able to estimate the local implicit dimensionalities of each point very effectively, and thereby create a variable dimensionality reduced representation of the data. Such a technique has the advantage that it is very adaptive about adjusting its representation depending upon the behavior of the immediate locality of a data point. An interesting property of the subspace sampling technique is that unlike all other data reduction techniques, the overall efficiency of compression improves with increasing database size. This is a highly desirable property for any data reduction system since the problem itself is motivated by the large size of data sets. Because of its sampling approach, the procedure is extremely fast and scales linearly both with data set size and dimensionality. Furthermore, the subspace sampling technique is able to reveal important local subspace characteristics of high dimensional data which can be harnessed for effective solutions to problems such as selectivity estimation and approximate nearest neighbor search.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
|
 |
4
|
Shivnath Babu , Minos Garofalakis , Rajeev Rastogi, SPARTAN: a model-based semantic compression system for massive data tables, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.283-294, May 21-24, 2001, Santa Barbara, California, United States
|
 |
5
|
Norbert Beckmann , Hans-Peter Kriegel , Ralf Schneider , Bernhard Seeger, The R*-tree: an efficient and robust access method for points and rectangles, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.322-331, May 23-26, 1990, Atlantic City, New Jersey, United States
|
 |
6
|
|
| |
7
|
|
 |
8
|
Amol Deshpande , Minos Garofalakis , Rajeev Rastogi, Independence is good: dependency-based histogram synopses for high-dimensional data, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.199-210, May 21-24, 2001, Santa Barbara, California, United States
|
 |
9
|
Christos Faloutsos , King-Ip Lin, FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.163-174, May 22-25, 1995, San Jose, California, United States
|
 |
10
|
Dimitrios Gunopulos , George Kollios , Vassilis J. Tsotras , Carlotta Domeniconi, Approximating multi-dimensional aggregate range queries over real attributes, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.463-474, May 15-18, 2000, Dallas, Texas, United States
|
| |
11
|
K. Hoffman, R. Kunze. Linear Algebra, Prentice Hall, J, 1998.
|
| |
12
|
|
| |
13
|
H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Viswanath Poosala , Kenneth C. Sevcik , Torsten Suel, Optimal Histograms with Quality Guarantees, Proceedings of the 24rd International Conference on Very Large Data Bases, p.275-286, August 24-27, 1998
|
| |
14
|
W. Johnson, J. Lindenstrauss. Extensions of Lipschitz mapping into a Hilbert space. Conference in modern analysis and probability, pages 189-206, 1984.
|
| |
15
|
I. T. Jolliffee. Principal Component Analysis, Springer-Verlag, New York, 1986.
|
 |
16
|
Nick Roussopoulos , Stephen Kelley , Frédéric Vincent, Nearest neighbor queries, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.71-79, May 22-25, 1995, San Jose, California, United States
|
 |
17
|
Yossi Matias , Jeffrey Scott Vitter , Min Wang, Wavelet-based histograms for selectivity estimation, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.448-459, June 01-04, 1998, Seattle, Washington, United States
|
| |
18
|
|
| |
19
|
J. Ziv. A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transaction on Information Theory, 23(3):337-343, 1977.
|
CITED BY 6
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jing Peng , Chang-jie Tang , Dong-qing Yang , Jing Zhang , Jian-jun Hu, Similarity computing model of high dimension data for symptom classification of Chinese traditional medicine, Applied Soft Computing, v.9 n.1, p.209-218, January, 2009
|
|