| Semi-supervised approach to rapid and reliable labeling of large data sets |
| Full text |
Pdf
(275 KB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Las Vegas, Nevada, USA
SESSION: Research papers
table of contents
Pages 641-649
Year of Publication: 2008
ISBN:978-1-60558-193-4
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 17, Downloads (12 Months): 156, Citation Count: 0
|
|
|
ABSTRACT
In this paper, we propose a method, where the labeling of the data set is carried out in a semi-supervised manner with user-specified guarantees about the quality of the labeling. In our scheme, we assume that for each class, we have some heuristics available, each of which can identify instances of one particular class. The heuristics are assumed to have reasonable performance but they do not need to cover all instances of the class nor do they need to be perfectly reliable. We further assume that we have an infallible expert, who is willing to manually label a few instances. The aim of the algorithm is to exploit the cluster structure of the problem, the predictions by the imperfect heuristics and the limited perfect labels provided by the expert to classify (label) the instances of the data set with guaranteed precision (specificed by the user) with regards to each class. The specified precision is not always attainable, so the algorithm is allowed to classify some instances as dontknow. The algorithm is evaluated by the number of instances labeled by the expert, the number of dontknow instances (global coverage) and the achieved quality of the labeling. On the KDD Cup Network Intrusion data set containing 500,000 instances, we managed to label 96.6% of the instances while guaranteeing a nominal precision of 90% (with 95% confidence) by having the expert label 630 instances; and by having the expert label 1200 instances, we managed to guarantee 95% nominal precision while labeling 96.4% of the data. We also provide a case study of applying our scheme to label the network traffic collected at a large campus network.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Kdd cup '99 data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
|
| |
2
|
Les Atlas , David Cohn , Richard Ladner , M. A. El-Sharkawi , R. J. Marks, II, Training connectionist networks with queries and selective sampling, Advances in neural information processing systems 2, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1990
|
| |
3
|
L. D. Brown and et al. Interval estimation for a binomial proportion. Statistical Science, 16(2), 2001.
|
 |
4
|
Nicolò Cesa-Bianchi , Yoav Freund , David Haussler , David P. Helmbold , Robert E. Schapire , Manfred K. Warmuth, How to use expert advice, Journal of the ACM (JACM), v.44 n.3, p.427-485, May 1997
[doi> 10.1145/258128.258179]
|
| |
5
|
|
| |
6
|
O. Chapelle and et al. Semi-Supervised Learning. MIT Press, 2006.
|
 |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
 |
11
|
Thomas Karagiannis , Andre Broido , Michalis Faloutsos , Kc claffy, Transport layer identification of P2P traffic, Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, October 25-27, 2004, Taormina, Sicily, Italy
[doi> 10.1145/1028788.1028804]
|
| |
12
|
|
| |
13
|
|
| |
14
|
A. W. Moore and et al. Toward the accurate identification of network applications. In PAM, 2005.
|
| |
15
|
M. Seeger. Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, 2002.
|
 |
16
|
H. S. Seung , M. Opper , H. Sompolinsky, Query by committee, Proceedings of the fifth annual workshop on Computational learning theory, p.287-294, July 27-29, 1992, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/130385.130417]
|
| |
17
|
|
| |
18
|
G. J. Simon and et al. Scan detection - a data mining approach. In SIAM SDM, 2006.
|
 |
19
|
|
| |
20
|
M. Steinbach and et al. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.
|
| |
21
|
The SANS Institute. Internet storm center. http://isc.sans.org.
|
 |
22
|
Kuai Xu , Zhi-Li Zhang , Supratik Bhattacharyya, Profiling internet backbone traffic: behavior models and applications, Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications, August 22-26, 2005, Philadelphia, Pennsylvania, USA
|
| |
23
|
|
| |
24
|
X. Zhu. Semi-supervised learning survey. Technical Report TR 1530, University of Wisconsin, 2006.
|
|