ACM Home Page
Please provide us with feedback. Feedback
Estimating rates of rare events at multiple resolutions
Full text MovMov (17:35),  PdfPdf (1.27 MB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 16 - 25  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Authors
Deepak Agarwal  Yahoo! Research
Andrei Zary Broder  Yahoo! Research
Deepayan Chakrabarti  Yahoo! Research
Dejan Diklic  Yahoo! Research
Vanja Josifovski  Yahoo! Research
Mayssam Sayyadian  Yahoo! Research
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 89,   Citation Count: 0
Additional Information:

abstract   references   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281198
What is a DOI?

ABSTRACT

We consider the problem of estimating occurrence rates of rare eventsfor extremely sparse data, using pre-existing hierarchies to perform inference at multiple resolutions. In particular, we focus on the problem of estimating click rates for (webpage, advertisement) pairs (called impressions) where both the pages and the ads are classified into hierarchies that capture broad contextual information at different levels of granularity. Typically the click rates are low and the coverage of the hierarchies is sparse. To overcome these difficulties we devise a sampling method whereby we analyze aspecially chosen sample of pages in the training set, and then estimate click rates using a two-stage model. The first stage imputes the number of (webpage, ad) pairs at all resolutions of the hierarchy to adjust for the sampling bias. The second stage estimates clickrates at all resolutions after incorporating correlations among sibling nodes through a tree-structured Markov model. Both models are scalable and suited to large scale data mining applications. On a real-world dataset consisting of 1/2 billion impressions, we demonstrate that even with 95% negative (non-clicked) events in the training set, our method can effectively discriminate extremely rare events in terms of their click propensity.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1): 1--39, 1977.
 
2
 
3
D. B. Rubin. Multiple Imputation for Nonresponse in Surveys. Wiley, New York, US, 2004.
 
4
D. K. Agarwal, J. Silander, A. E. Gelfand, R. E. Dewar, and J. Mickelson. Tropical deforestation in madagascar: Analysis using hierarchical, spatially explicit, bayesian regression models. Ecological Modelling, 185(1):105--131, 2005.
 
5
H. C. Huang and N. Cressie. Multiscale graphical modeling in space: applications to command and control. In Proceedings of the Spatial Statistics Workshop, New York: Springer Lecture Notes in Statistics, Springer Verlag Publishers, 2000.
 
6
J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5):1470--1480, 1972.
 
7
K. C. Chou, A. S. Willsky, and R. Nikoukhah. Multiscale systems, kalman filters, and riccati equations. IEEE Transactions on Automatic Control, 39:479--492, 1994.
 
8
N. Cressie. Statistics for Spatial Data. John Wiley, New York, 1990.
 
9
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over sampling technique. Journal of Artificial Intelligence Research, 16:321--357, 2002.
 
10
 
11
S. Hill, D. Agarwal, R. Bell, and C. Volinsky. Building an effective representation for dynamic graphs. Journal of Computational and Graphical Statistics, 15:584--608, 2006.
 
12
S. Openshaw and P. Taylor. A million or so correlation coefficients. In In N. Wrigley (Ed.), Statistical Methods in the Spatial Sciences, London, pages 127--144, 1979.
 
13
S. Pandey, D. Agarwal, D. Chakrabarti, and V. Josifovski. Bandits for taxonomies: A model based approach. In Siam International Conference on Data Mining, Minnesota(to appear), 2007.
 
14
C. Wang, P. Zhang, R. Choi, and M. D. Eredita. Understanding consumers attitude toward advertising. pages 1143--1148, 2002.


REVIEW

"Apostolos N Papadopoulos : Reviewer"

Statistics are very important in understanding the way users behave. Many Web pages contain ads, and judging the effectiveness of these ads is an important issue. Toward this goal, the authors present a method to estimate the rate of click-through  more...

Collaborative Colleagues:
Deepak Agarwal: colleagues
Andrei Zary Broder: colleagues
Deepayan Chakrabarti: colleagues
Dejan Diklic: colleagues
Vanja Josifovski: colleagues
Mayssam Sayyadian: colleagues