ACM Home Page
Please provide us with feedback. Feedback
Get another label? improving data quality and data mining using multiple, noisy labelers
Full text PdfPdf (428 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Las Vegas, Nevada, USA
SESSION: Research papers table of contents
Pages 614-622  
Year of Publication: 2008
ISBN:978-1-60558-193-4
Authors
Victor S. Sheng  Leonard N. Stern School of Business, New York University, New York, NY, USA
Foster Provost  Leonard N. Stern School of Business, New York University, New York, NY, USA
Panagiotis G. Ipeirotis  Leonard N. Stern School of Business, New York University, New York, NY, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 26,   Downloads (12 Months): 262,   Citation Count: 9
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1401890.1401965
What is a DOI?

ABSTRACT

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Blake, C. L., and Merz, C. J. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, 1998.
 
3
Boutell, M. R., Luo, J., Shen, X., and Brown, C. M. Learning multi-label scene classification. Pattern Recognition 37, 9 (Sept. 2004), 1757--1771.
 
4
 
5
 
6
Dawid, A. P., and Skene, A. M. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28, 1 (Sept. 1979), 20--28.
7
 
8
Elkan, C. The foundations of cost-sensitive learning. In IJCAI (2001), pp. 973--978.
 
9
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. Bayesian Data Analysis, 2nd ed. Chapman and Hall/CRC, 2003.
 
10
Jin, R., and Ghahramani, Z. Learning with multiple labels. In NIPS (2002), pp. 897--904.
 
11
Kapoor, A., and Greiner, R. Learning and classifying under hard budgets. In ECML (2005), pp. 170--181.
 
12
Lizotte, D. J., Madani, O., and Greiner, R. Budgeted learning of naive-bayes classifiers. In UAI) (2003), pp. 378--385.
 
13
 
14
Margineantu, D. D. Active cost-sensitive learning. In IJCAI) (2005), pp. 1622--1613.
 
15
McCallum, A. Multi-label text classification with a mixture model trained by EM. In AAAI'99 Workshop on Text Learning (1999).
 
16
 
17
18
19
 
20
Provost, F., and Danyluk, A. Learning from Bad Data. In Proceedings of the ML-95 Workshop on Applying Machine Learning in Practice (1995).
 
21
 
22
Saar-Tsechansky, M., Melville, P., and Provost, F. J. Active feature-value acquisition. Tech. Rep. IROM-08-06, University of Texas at Austin, McCombs Research Paper Series, Sept. 2007.
 
23
 
24
Silverman, B. W. Some asymptotic properties of the probabilistic teacher. IEEE Transactions on Information Theory 26, 2 (Mar. 1980), 246--249.
 
25
Smyth, P. Learning with probabilistic supervision. In Computational Learning Theory and Natural Learning Systems, Vol. III: Selecting Good Models, T. Petsche, Ed. MIT Press, Apr. 1995.
 
26
 
27
Smyth, P., Burl, M. C., Fayyad, U. M., and Perona, P. Knowledge discovery in large image databases: Dealing with uncertainties in ground truth. In Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop (KDD'94) (1994), pp. 109--120.
 
28
Smyth, P., Fayyad, U. M., Burl, M. C., Perona, P., and Baldi, P. Inferring ground truth from subjective labelling of Venus images. In NIPS (1994), pp. 1085--1092.
 
29
 
30
Turney, P. D. Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research 2 (1995), 369--409.
 
31
Turney, P. D. Types of cost in inductive concept learning. In Proceedings of the ICML-2000 Workshop on Cost-Sensitive Learning (2000), pp. 15--21.
 
32
Weiss, G. M., and Provost, F. J. Learning when training data are costly: The e ect of class distribution on tree induction. Journal of Artificial Intelligence Research 19 (2003), 315--354.
 
33
Whittle, P. Some general points in the theory of optimal experimental design. Journal of the Royal Statistical Society, Series B (Methodological) 35, 1 (1973), 123--130.
 
34
 
35
 
36
 
37

CITED BY  9

Collaborative Colleagues:
Victor S. Sheng: colleagues
Foster Provost: colleagues
Panagiotis G. Ipeirotis: colleagues