ACM Home Page
Please provide us with feedback. Feedback
Scaling to very very large corpora for natural language disambiguation
Full text Publisher SitePublisher Site PdfPdf (80 KB)
Source Annual Meeting of the ACL archive
Proceedings of the 39th Annual Meeting on Association for Computational Linguistics table of contents
Toulouse, France
Pages: 26 - 33  
Year of Publication: 2001
Authors
Michele Banko  Microsoft Research, Redmond, WA
Eric Brill  Microsoft Research, Redmond, WA
Publisher
Association for Computational Linguistics  Morristown, NJ, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 50,   Citation Count: 39
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: 10.3115/1073012.1073017

Warning: The download time has expired please click on the item to try again.


ABSTRACT

The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
Charniak, E. (1996). Treebank Grammars, Proceedings AAAI-96, Menlo Park, Ca.
 
5
Dagan, I. and Engelson, S. (1995). Committee-based sampling for training probabilistic classifiers. In Proc. ML-95, the 12th Int. Conf. on Machine Learning.
 
6
Gale, W. A., Church, K. W., and Yarowsky, D. (1993). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415--439.
 
7
Golding, A. R. (1995). A Bayesian hybrid method for context-sensitive spelling correction. In Proc. 3rd Workshop on Very Large Corpora, Boston, MA.
 
8
 
9
 
10
Henderson, J. C. and Brill, E (1999). Exploiting diversity in natural language processing: combining parsers. In 1999 Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. ACL, New Brunswick NJ. 187--194.
 
11
 
12
Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling. Proceedings of the Eleventh International Conference on Machine Learning (pp. 148--156). New Brunswick, NJ: Morgan Kaufmann.
 
13
 
14
 
15
Mitchell, T. M. (1999), The role of unlabeled data in supervised learning, in Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain.
 
16
 
17
 
18
Powers, D. (1997). Learning and application of differential grammars. In Proc. Meeting of the ACL Special Interest Group in Natural Language Learning, Madrid.
 
19
 
20
Weng, F., Stolcke, A, & Sankar, A (1998). Efficient lattice representation and generation. Proc. Intl. Conf. on Spoken Language Processing, vol. 6, pp. 2531--2534. Sydney, Australia.
 
21
 
22

CITED BY  39
Collaborative Colleagues:
Michele Banko: colleagues
Eric Brill: colleagues