ACM Home Page
Please provide us with feedback. Feedback
Training conditional random fields via gradient tree boosting
Full text PdfPdf (183 KB)
Source ACM International Conference Proceeding Series; Vol. 69 archive
Proceedings of the twenty-first international conference on Machine learning table of contents
Banff, Alberta, Canada
Page: 28  
Year of Publication: 2004
ISBN:1-58113-828-5
Authors
Thomas G. Dietterich  Oregon State University, Corvallis, OR
Adam Ashenfelter  Oregon State University, Corvallis, OR
Yaroslav Bulatov  Oregon State University, Corvallis, OR
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 55,   Citation Count: 7
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1015330.1015428
What is a DOI?

ABSTRACT

Conditional Random Fields (CRFs; Lafferty, McCallum, & Pereira, 2001) provide a flexible and powerful model for learning to assign labels to elements of sequences in such applications as part-of-speech tagging, text-to-speech mapping, protein and DNA sequence analysis, and information extraction from web pages. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features. This paper describes a new method for training CRFs by applying Friedman's (1999) gradient tree boosting method. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees. Regression trees are learned by stage-wise optimizations similar to Adaboost, but with the objective of maximizing the conditional likelihood P(Y|X) of the CRF model. By growing regression trees, interactions among features are introduced only as needed, so although the parameter space is potentially immense, the search algorithm does not explicitly consider the large space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially like previous algorithms based on iterative scaling and gradient descent.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society B, 36, 192--236.
 
2
Besag, J. (1977). Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika, 64, 616--618.
 
3
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Wadsworth International Group.
 
4
 
5
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. ICML-96 (pp. 148--156). Morgan Kaufmann.
 
6
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29.
 
7
Geman, D. (1998). Random fields and inverse problems in imaging. In A. Ancona, D. Geman and N. Ikeda (Eds.), École d'Été de probabilités de saintflour xviii, Lecture Notes in Mathematics 1427, 117--196. Berlin: Springer-Verlag.
 
8
Hammersley, J. M., & Clifford, P. (1971). Markov fields on finite graphs and lattices (Technical Report). Unpublished.
 
9
 
10
McCallum, A. (2003). Efficiently inducing features of conditional random fields. UAI-2003 (pp. 403--410). San Francisco, CA: Morgan Kaufmann.
 
11
 
12
Qian, N., & Sejnowski, T. J. (1988). Predicting the secondary structure of globular proteins using neural network models. J. Mol. Bio., 202, 865--884.
 
13
 
14
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257--286.
 
15
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. Proceedings of the conference on empirical methods in natural language processing (pp. 133--142). Somerset, NJ: ACL.
 
16
Sejnowski, T. J., & Rosenberg, C. R. (1987). Parallel networks that learn to pronouce English text. Complex Systems, 1, 145--168.

CITED BY  7
Collaborative Colleagues:
Thomas G. Dietterich: colleagues
Adam Ashenfelter: colleagues
Yaroslav Bulatov: colleagues