ACM Home Page
Please provide us with feedback. Feedback
A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems
Full text PdfPdf (712 KB)
Source ACM SIGKDD Explorations Newsletter archive
Volume 3 ,  Issue 1  (July 2001) table of contents
COLUMN: Contributed articles table of contents
Pages: 27 - 32  
Year of Publication: 2001
ISSN:1931-0145
Author
Daniele Micci-Barreca  ClearCommerce Corporation, Austin, TX
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 32,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/507533.507538
What is a DOI?

ABSTRACT

Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.This paper presents a simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression. The proposed method is based on a well-established statistical method (empirical Bayes) that is straightforward to implement as an in-database procedure. Furthermore, for categorical attributes with an inherent hierarchical structure, like ZIP codes, the preprocessing scheme can directly leverage the hierarchy by blending statistics at the various levels of aggregation.While the statistical methods discussed in this paper were first introduced in the mid 1950's, the use of these methods as a preprocessing step for complex models, like neural networks, has not been previously discussed in any literature.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Carlin, B. P. and Louis T. A. Bayes and Empirical Bayes Methods for Data Analysis, New York, Chapman & Hall, 1996
 
3
 
4
Cestnik B., Estimating Probabilities: A Crucial Task in Machine Learning, Proc. of the 9th European Conf. on Artificial Intelligence, ECAI'90, 147-149, 1990
 
5
Gnanadesikan, R., Methods for Statistical Data Analysis of Multivariate Observations, Wiley, New York, 1977
 
6
Good, L. J., Probability and the weighting of evidence, London, Charles Griffing & Company Limited, 1950
 
7
 
8
Johnson, S. C. Hierarchical Clustering Schemes, Psychometrika, 2:241-254, 1967
 
9
 
10
Nishisato, S. Analysis of Categorical Data: Dual Scaling and Its Applications, Toronto: Toronto University Press, 1980
 
11
 
12
 
13
Robbins, H. An empirical Bayes approach to statistics, In Proc. 3rd Berkeley Symposium on Math Statistics and Probability, 1, Berkeley, CA: University of California Press, 157-164, 1955