|
ABSTRACT
The Boolean semantics of SQL queries cannot adequately capture the "fuzzy" preferences and "soft" criteria required in non-traditional data retrieval applications. One way to solve this problem is to add a flavor of "information retrieval" into database queries by allowing fuzzy query conditions and flexibly supporting grouping and ranking of the query results within the DBMS engine. While ranking is already supported by all major commercial DBMSs natively, support of flexibly grouping is still very limited (i.e., group-by). In this paper, we propose to generalize group-by to enable flexible grouping (clustering specifically) of the query results. Different from clustering in data mining applications, our focus is on supporting efficient clustering of Boolean results generated at query time. Moreover, we propose to integrate ranking and clustering with Boolean conditions, forming a new type of ClusterRank query to allow structured data retrieval. Such an integration is nontrivial in terms of both semantics and query processing. We investigate various semantics of this type of queries. To process such queries, a straightforward approach is to simply glue the techniques developed for ranking-only and clustering-only together. This approach is costly since both ranking and clustering are treated as blocking post-processing tasks upon Boolean query results by existing techniques. We propose a summary-based evaluation method that utilizes bitmap index to seamlessly integrate Boolean conditions, clustering, and ranking. Experimental study shows that our approach significantly outperforms the straightforward one and maintains high clustering quality.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated ranking of database query results. In CIDR, 2003.
|
| |
2
|
P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
|
| |
3
|
M. J. Carey and D. Kossmann. On saying "enough already!" in SQL. In SIGMOD, pages 219--230, 1997.
|
 |
4
|
|
| |
5
|
C. Y. Chan and Y. E. Ioannidis. An efficient bitmap encoding scheme for selection queries. In SIGMOD, 1999.
|
| |
6
|
|
| |
7
|
S. Chaudhuri, R. Ramakrishnan, and G. Weikum. Integrating DB and IR technologies: What is the sound of one hand clapping? In CIDR, pages 1--12, 2005.
|
| |
8
|
D. Donjerkovic and R. Ramakrishnan. Probabilistic optimization of top n queries. In VLDB, 1999.
|
| |
9
|
R. Fagin, A. Lote, and M. Naor. Optimal aggregation algorithms for middleware. In PODS, 2001.
|
 |
10
|
|
 |
11
|
Venkatesh Ganti , Johannes Gehrke , Raghu Ramakrishnan, CACTUS—clustering categorical data using summaries, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.73-83, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312201]
|
 |
12
|
Martin Gavrilov , Dragomir Anguelov , Piotr Indyk , Rajeev Motwani, Mining the stock market (extended abstract): which measure is best?, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.487-496, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347189]
|
| |
13
|
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, New York, 2000.
|
| |
14
|
V. Hristidis, N. Koudas, and Y. Papakonstantinou. PREFER: A system for the efficient execution of multi-parametric ranked queries. SIGMOD, 2001.
|
| |
15
|
I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting top-k join queries in relational databases. In VLDB, pages 754--765, 2003.
|
 |
16
|
|
| |
17
|
K. Kerdprasop, N. Kerdprasop, and P. Sattayatham. Weighted k-means for density-biased clustering. In DaWaK, pages 488--497, 2005.
|
 |
18
|
|
| |
19
|
A. Leuski and J. Allan. Improving interactive retrieval by combining ranked lists and clustering. In RIAO, pages 665--681, 2000.
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
 |
23
|
|
| |
24
|
P. E. O'Neil and D. Quass. Improved query performance with variant indexes. In SIGMOD, pages 38--49, 1997.
|
| |
25
|
|
| |
26
|
R. R. Sinha, S. Mitra, and M. Winslett. Bitmap indexes for large scientific data sets: A case study. In IPDPS, 2006.
|
| |
27
|
|
 |
28
|
|
| |
29
|
|
| |
30
|
F. Zemke, K. Kulkarni, A. Witkowski, and B. Lyle. Introduction to OLAP function. Change proposal. ANS-NCTS H2-99-14 (April), 1999.
|
| |
31
|
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In SIGMOD, pages 103--114, 1996.
|
CITED BY 4
|
|
|
|
|
|
|
|
Ka Cheung Sia , Junghoo Cho , Yun Chi , Belle L. Tseng, Efficient computation of personal aggregate queries on blogs, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|