| Improving classification accuracy using automatically extracted training data |
| Full text |
Mov
(12:54),
Pdf
(573 KB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Paris, France
SESSION: Industrial track papers
table of contents
Pages 1145-1154
Year of Publication: 2009
ISBN:978-1-60558-495-9
|
|
Authors
|
|
Ariel Fuxman
|
Microsoft Research, Mountain View, CA, USA
|
|
Anitha Kannan
|
Microsoft Research, Mountain View, CA, USA
|
|
Andrew B. Goldberg
|
Univ. of Wisconsin-Madison, Madison, WI, USA
|
|
Rakesh Agrawal
|
Microsoft Research, Mountain View, CA, USA
|
|
Panayiotis Tsaparas
|
Microsoft Research, Mountain View, CA, USA
|
|
John Shafer
|
Microsoft Research, Mountain View, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 39, Downloads (12 Months): 133, Citation Count: 0
|
|
|
ABSTRACT
Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used. We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
R. Agrawal , A. Halverson , K. Kenthapadi , N. Mishra , P. Tsaparas, Generating labels from clicks, Proceedings of the Second ACM International Conference on Web Search and Data Mining, February 09-12, 2009, Barcelona, Spain
[doi> 10.1145/1498759.1498824]
|
 |
2
|
|
| |
3
|
R. A. Baeza-Yates, L. Calderon-Benavides, and C. N. Gonzalez-Caro. The intention behind web queries. In SPIRE, pages 98--109, 2006.
|
| |
4
|
|
 |
5
|
Steven M. Beitzel , Eric C. Jensen , Ophir Frieder , David Grossman , David D. Lewis , Abdur Chowdhury , Aleksandr Kolcz, Automatic web query classification using labeled and unlabeled training data, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
[doi> 10.1145/1076034.1076138]
|
| |
6
|
|
 |
7
|
Honghua (Kathy) Dai , Lingzhi Zhao , Zaiqing Nie , Ji-Rong Wen , Lee Wang , Ying Li, Detecting online commercial intention (OCI), Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
[doi> 10.1145/1135777.1135902]
|
 |
8
|
|
 |
9
|
Thorsten Joachims , Laura Granka , Bing Pan , Helene Hembrooke , Filip Radlinski , Geri Gay, Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search, ACM Transactions on Information Systems (TOIS), v.25 n.2, p.7-es, April 2007
[doi> 10.1145/1229179.1229181]
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
X. Zeng and T. R. Martinez. A noise filtering method using neural networks. In Proceedings of the International Workshop of Soft Computing Techniques in Instrumentation, Measurement and Related Applications, pages 26--31, 2003.
|
| |
17
|
X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.
|
|