|
ABSTRACT
How to mine massive datasets is a challenging problem with great potential value. Motivated by this challenge, much effort has concentrated on developing scalable versions of machine learning algorithms. However, the cost of mining large datasets is not just computational; preparing the datasets into the “right form” so that learning algorithms can be applied is usually costly, due to the human labor that is typically required and a large number of choices in data preparation, which include selecting different subsets of data and aggregating data at different granularities. We make the key observation that, for a number of practically motivated problems, these choices can be defined using database queries and analyzed in an automatic and systematic manner. Specifically, we propose a new class of data-mining problem, called bellwether analysis, in which the goal is to find a few query-defined predictors (e.g., first week sales of Peoria, IL of an item) that can be used to accurately predict the result of a target query (e.g., first year worldwide sales of the item) from a large number of queries that define candidate predictors. To make a prediction for a new item, the data needed to generate such predictors has to be collected (e.g., selling the new item in Peoria, IL for a week and collecting the sales data). A useful predictor is one that has high prediction accuracy and a low data-collection cost. We call such a cost-effective predictor a bellwether. This article introduces bellwether analysis, which integrates database query processing and predictive modeling into a single framework, and provides scalable algorithms for large datasets that cannot fit in main memory. Through a series of extensive experiments, we show that bellwethers do exist in real-world databases, and that our computation techniques achieve good efficiency on large datasets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Sameet Agarwal , Rakesh Agrawal , Prasad Deshpande , Ashish Gupta , Jeffrey F. Naughton , Raghu Ramakrishnan , Sunita Sarawagi, On the Computation of Multidimensional Aggregates, Proceedings of the 22th International Conference on Very Large Data Bases, p.506-521, September 03-06, 1996
|
| |
2
|
|
| |
3
|
Benjamini, Y. and Yekutieli, D. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Royal Statist. Soc. B 57, 1, 289--300.
|
| |
4
|
Benjamini, Y. and Yekutieli, D. 2001. The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29, 4, 1165--1188.
|
 |
5
|
|
| |
6
|
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. 1984. Classification and Regression Trees. Chapman & Hall.
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
Lei Chen , Raghu Ramakrishnan , Paul Barford , Bee-Chung Chen , Vinod Yegneswaran, Composite subset measures, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea
|
| |
13
|
Chvatal, V. 1979. A greedy heuristic for the set-covering problem. Math. Oper. Res. 4, 3, 233--235.
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
Efron, B. 2005. Local false discovery rates. Tech. rep. 2005-20B/234, Department of Statistics, Stanford Univerity, California.
|
| |
19
|
|
| |
20
|
|
| |
21
|
Jim Gray , Surajit Chaudhuri , Adam Bosworth , Andrew Layman , Don Reichart , Murali Venkatrao , Frank Pellow , Hamid Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals, Data Mining and Knowledge Discovery, v.1 n.1, p.29-53, 1997
[doi> 10.1023/A:1009726021843]
|
| |
22
|
|
| |
23
|
|
| |
24
|
Hamilton, J. 1994. Time Series Analysis. Princeton University Press.
|
 |
25
|
Jiawei Han , Jian Pei , Guozhu Dong , Ke Wang, Efficient computation of Iceberg cubes with complex measures, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.1-12, May 21-24, 2001, Santa Barbara, California, United States
|
| |
26
|
|
| |
27
|
Kapoor, A. and Greiner, R. 2005. Learning and classifying under hard budgets. In Proceedings of the 16th European Conference on Machine Learning (ECML'05). Lecture Notes in Computer Science, vol. 3720, 170--181.
|
| |
28
|
|
 |
29
|
Charles X. Ling , Qiang Yang , Jianning Wang , Shichao Zhang, Decision trees with minimal costs, Proceedings of the twenty-first international conference on Machine learning, p.69, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015369]
|
| |
30
|
Lizotte, D., Madani, O., and Greiner, R. 2003. Budgeted learning of naïve-Bayes classifiers. In Proceedings of the 19th Conference of Uncertainty in Artificial Intelligence (UAI'03).
|
| |
31
|
Mehta, M., Rissanen, J., and Agrawal, R. 1995. MDL-Based decision tree pruning. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'95), 216--221.
|
| |
32
|
|
| |
33
|
|
| |
34
|
|
 |
35
|
Raymond T. Ng , Alan Wagner , Yu Yin, Iceberg-cube computation with PC clusters, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.25-36, May 21-24, 2001, Santa Barbara, California, United States
|
 |
36
|
|
| |
37
|
|
| |
38
|
|
| |
39
|
|
| |
40
|
|
 |
41
|
|
| |
42
|
|
| |
43
|
|
| |
44
|
Seber, G. and Lee, A. 2003. Linear Regression Analysis. John Wiley & Sons.
|
| |
45
|
|
 |
46
|
|
 |
47
|
|
| |
48
|
Turney, P. D. 1995. Cost-Sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. J. Artif. Intell. Res. 2, 369--409.
|
| |
49
|
|
 |
50
|
|
| |
51
|
|
| |
52
|
Dong Xin , Jiawei Han , Xiaolei Li , Benjamin W. Wah, Star-cubing: computing iceberg cubes by top-down and bottom-up integration, Proceedings of the 29th international conference on Very large data bases, p.476-487, September 09-12, 2003, Berlin, Germany
|
| |
53
|
|
|