ACM Home Page
Please provide us with feedback. Feedback
Improving classification accuracy using automatically extracted training data
Full text MovMov (12:54),  PdfPdf (573 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Paris, France
SESSION: Industrial track papers table of contents
Pages 1145-1154  
Year of Publication: 2009
ISBN:978-1-60558-495-9
Authors
Ariel Fuxman  Microsoft Research, Mountain View, CA, USA
Anitha Kannan  Microsoft Research, Mountain View, CA, USA
Andrew B. Goldberg  Univ. of Wisconsin-Madison, Madison, WI, USA
Rakesh Agrawal  Microsoft Research, Mountain View, CA, USA
Panayiotis Tsaparas  Microsoft Research, Mountain View, CA, USA
John Shafer  Microsoft Research, Mountain View, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 39,   Downloads (12 Months): 133,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1557019.1557143
What is a DOI?

ABSTRACT

Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used.

We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
R. A. Baeza-Yates, L. Calderon-Benavides, and C. N. Gonzalez-Caro. The intention behind web queries. In SPIRE, pages 98--109, 2006.
 
4
5
 
6
7
8
9
10
11
 
12
 
13
 
14
 
15
 
16
X. Zeng and T. R. Martinez. A noise filtering method using neural networks. In Proceedings of the International Workshop of Soft Computing Techniques in Instrumentation, Measurement and Related Applications, pages 26--31, 2003.
 
17
X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.

Collaborative Colleagues:
Ariel Fuxman: colleagues
Anitha Kannan: colleagues
Andrew B. Goldberg: colleagues
Rakesh Agrawal: colleagues
Panayiotis Tsaparas: colleagues
John Shafer: colleagues