ACM Home Page
Please provide us with feedback. Feedback
Programming the K-means clustering algorithm in SQL
Full text PdfPdf (157 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Seattle, WA, USA
POSTER SESSION: Industry/government track posters table of contents
Pages: 823 - 828  
Year of Publication: 2004
ISBN:1-58113-888-1
Author
Carlos Ordonez  Teradata, NCR, San Diego, CA
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 25,   Downloads (12 Months): 182,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1014052.1016921
What is a DOI?

ABSTRACT

Using SQL has not been considered an efficient and feasible way to implement data mining algorithms. Although this is true for many data mining, machine learning and statistical algorithms, this work shows it is feasible to get an efficient SQL implementation of the well-known K-means clustering algorithm that can work on top of a relational DBMS. The article emphasizes both correctness and performance. From a correctness point of view the article explains how to compute Euclidean distance, nearest-cluster queries and updating clustering results in SQL. From a performance point of view it is explained how to cluster large data sets defining and indexing tables to store and retrieve intermediate and final results, optimizing and avoiding joins, optimizing and simplifying clustering aggregations, and taking advantage of sufficient statistics. Experiments evaluate scalability with synthetic data sets varying size and dimensionality. The proposed K-means implementation can cluster large data sets and exhibits linear scalability.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In ACM KDD Conference, pages 9--15, 1998.
3
 
4
G. Graefe, U. Fayyad, and S. Chaudhuri. On the efficient gathering of sufficient statistics for classification from large SQL databases. In ACM KDD Conference, pages 204--208, 1998.
 
5
 
6
J.B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967.
7
8
9
 
10
11
12
13
14