|
ABSTRACT
In a data mining project, a significant portion of time is devoted to building a data set suitable for analysis. In a relational database environment, building such data set usually requires joining tables and aggregating columns with SQL queries. Existing SQL aggregations are limited since they return a single number per aggregated group, producing one row for each computed number. These aggregations help, but a significant effort is still required to build data sets suitable for data mining purposes, where a tabular format is generally required. This work proposes very simple, yet powerful, extensions to SQL aggregate functions to produce aggregations in tabular form, returning a set of numbers instead of one number per row. We call this new class of functions horizontal aggregations. Horizontal aggregations help building answer sets in tabular form (e.g. point-dimension, observation-variable, instance-feature), which is the standard form needed by most data mining algorithms. Two common data preparation tasks are explained, including transposition/aggregation and transforming categorical attributes into binary dimensions. We propose two strategies to evaluate horizontal aggregations using standard SQL. The first strategy is based only on relational operators and the second one uses the "case" construct. Experiments with large data sets study the proposed query optimization strategies.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
 |
2
|
Gautam Bhargava , Piyush Goel , Bala Iyer, Hypergraph based reorderings of outer join queries with complex predicates, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.304-315, May 22-25, 1995, San Jose, California, United States
|
 |
3
|
|
 |
4
|
John Clear , Debbie Dunn , Brad Harvey , Michael Heytens , Peter Lohman , Abhay Mehta , Mark Melton , Lars Rohrberg , Ashok Savasere , Robert Wehrmeister , Melody Xu, NonStop SQL/MX primitives for knowledge discovery, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.425-429, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312309]
|
 |
5
|
|
 |
6
|
|
| |
7
|
U. Fayyad and G. Piateski-Shapiro. From Data Mining to Knowledge Discovery. MIT Press, 1995.
|
 |
8
|
|
| |
9
|
G. Graefe, U. Fayyad, and S. Chaudhuri. On the efficient gathering of sufficient statistics for classification from large SQL databases. In ACM KDD Conference, pages 204--208, 1998.
|
| |
10
|
Jim Gray , Adam Bosworth , Andrew Layman , Hamid Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total, Proceedings of the Twelfth International Conference on Data Engineering, p.152-159, February 26-March 01, 1996
|
| |
11
|
A. Hinneburg, D. Habich, and W. Lehner. Combi-operator-database support for data mining applications. In VLDB Conference, pages 429--439, 2003.
|
 |
12
|
|
 |
13
|
|
 |
14
|
|
 |
15
|
|
| |
16
|
|
 |
17
|
|
 |
18
|
|
| |
19
|
|
 |
20
|
Sunita Sarawagi , Shiby Thomas , Rakesh Agrawal, Integrating association rule mining with relational database systems: alternatives and implications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.343-354, June 01-04, 1998, Seattle, Washington, United States
|
 |
21
|
|
| |
22
|
H. Wang, C. Zaniolo, and C. R. Luo. ATLAS: A small but complete SQL extension for data mining and data streams. In VLDB Conference, pages 1113--1116, 2003.
|
 |
23
|
|
 |
24
|
Andrew Witkowski , Srikanth Bellamkonda , Tolga Bozkaya , Gregory Dorman , Nathan Folkert , Abhinav Gupta , Lei Shen , Sankar Subramanian, Spreadsheets in RDBMS for OLAP, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California
[doi> 10.1145/872757.872767]
|
|