ACM Home Page
Please provide us with feedback. Feedback
Feature selection, L1 vs. L2 regularization, and rotational invariance
Full text PdfPdf (193 KB)
Source ACM International Conference Proceeding Series; Vol. 69 archive
Proceedings of the twenty-first international conference on Machine learning table of contents
Banff, Alberta, Canada
Page: 78  
Year of Publication: 2004
ISBN:1-58113-828-5
Author
Andrew Y. Ng  Stanford University, Stanford, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 26,   Downloads (12 Months): 178,   Citation Count: 21
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1015330.1015435
What is a DOI?

ABSTRACT

We consider supervised learning in the presence of very many irrelevant features, and study two different regularization methods for preventing overfitting. Focusing on logistic regression, we show that using L1 regularization of the parameters, the sample complexity (i.e., the number of training examples required to learn "well,") grows only logarithmically in the number of irrelevant features. This logarithmic rate matches the best known bounds for feature selection, and indicates that L1 regularized logistic regression can be effective even if there are exponentially many irrelevant features as there are training examples. We also give a lower-bound showing that any rotationally invariant algorithm---including logistic regression with L2 regularization, SVMs, and neural networks trained by backpropagation---has a worst case sample complexity that grows at least linearly in the number of irrelevant features.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Bertsekas, D. P. (1982). Constrained optimization and lagrange multiplier methods. New York: Academic Press.
 
2
Bordley, R. (1982). A multiplicative formula for aggregating probability assessments. Management Science, 28, 1137--1148.
 
3
Breese, J. S., Heckerman, D., & Kadie, C. (1998). Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Proceedings of UAI 14 (pp. 43--52).
4
 
5
 
6
 
7
 
8
Marlin, B. (2003). Modeling user rating profiles for collaborative filtering. Proceedings of NIPS 17.
 
9
Marlin, B. (2004). Collaborative filtering: A machine learning perspective. Master's thesis, University of Toronto.
 
10
11

CITED BY  21