ACM Home Page
Please provide us with feedback. Feedback
Bellwether analysis: Searching for cost-effective query-defined predictors in large databases
Full text PdfPdf (798 KB)
Source
ACM Transactions on Knowledge Discovery from Data (TKDD) archive
Volume 3 ,  Issue 1  (March 2009) table of contents
Article No. 5  
Year of Publication: 2009
ISSN:1556-4681
Authors
Bee-Chung Chen  Yahoo! Research, Santa Clara, CA
Raghu Ramakrishnan  Yahoo! Research, Santa Clara, CA
Jude W. Shavlik  University of Wisconsin—Madison, WI
Pradeep Tamma  Microsoft, Redmond, WA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 30,   Downloads (12 Months): 190,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1497577.1497582
What is a DOI?

ABSTRACT

How to mine massive datasets is a challenging problem with great potential value. Motivated by this challenge, much effort has concentrated on developing scalable versions of machine learning algorithms. However, the cost of mining large datasets is not just computational; preparing the datasets into the “right form” so that learning algorithms can be applied is usually costly, due to the human labor that is typically required and a large number of choices in data preparation, which include selecting different subsets of data and aggregating data at different granularities. We make the key observation that, for a number of practically motivated problems, these choices can be defined using database queries and analyzed in an automatic and systematic manner. Specifically, we propose a new class of data-mining problem, called bellwether analysis, in which the goal is to find a few query-defined predictors (e.g., first week sales of Peoria, IL of an item) that can be used to accurately predict the result of a target query (e.g., first year worldwide sales of the item) from a large number of queries that define candidate predictors. To make a prediction for a new item, the data needed to generate such predictors has to be collected (e.g., selling the new item in Peoria, IL for a week and collecting the sales data). A useful predictor is one that has high prediction accuracy and a low data-collection cost. We call such a cost-effective predictor a bellwether.

This article introduces bellwether analysis, which integrates database query processing and predictive modeling into a single framework, and provides scalable algorithms for large datasets that cannot fit in main memory. Through a series of extensive experiments, we show that bellwethers do exist in real-world databases, and that our computation techniques achieve good efficiency on large datasets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
Benjamini, Y. and Yekutieli, D. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Royal Statist. Soc. B 57, 1, 289--300.
 
4
Benjamini, Y. and Yekutieli, D. 2001. The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29, 4, 1165--1188.
5
 
6
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. 1984. Classification and Regression Trees. Chapman & Hall.
 
7
 
8
 
9
 
10
 
11
 
12
 
13
Chvatal, V. 1979. A greedy heuristic for the set-covering problem. Math. Oper. Res. 4, 3, 233--235.
 
14
 
15
 
16
 
17
 
18
Efron, B. 2005. Local false discovery rates. Tech. rep. 2005-20B/234, Department of Statistics, Stanford Univerity, California.
 
19
 
20
 
21
 
22
 
23
 
24
Hamilton, J. 1994. Time Series Analysis. Princeton University Press.
25
 
26
 
27
Kapoor, A. and Greiner, R. 2005. Learning and classifying under hard budgets. In Proceedings of the 16th European Conference on Machine Learning (ECML'05). Lecture Notes in Computer Science, vol. 3720, 170--181.
 
28
29
 
30
Lizotte, D., Madani, O., and Greiner, R. 2003. Budgeted learning of naïve-Bayes classifiers. In Proceedings of the 19th Conference of Uncertainty in Artificial Intelligence (UAI'03).
 
31
Mehta, M., Rissanen, J., and Agrawal, R. 1995. MDL-Based decision tree pruning. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'95), 216--221.
 
32
 
33
 
34
35
36
 
37
 
38
 
39
 
40
41
 
42
 
43
 
44
Seber, G. and Lee, A. 2003. Linear Regression Analysis. John Wiley & Sons.
 
45
46
47
 
48
Turney, P. D. 1995. Cost-Sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. J. Artif. Intell. Res. 2, 369--409.
 
49
50
 
51
 
52
 
53

Collaborative Colleagues:
Bee-Chung Chen: colleagues
Raghu Ramakrishnan: colleagues
Jude W. Shavlik: colleagues
Pradeep Tamma: colleagues