ACM Home Page
Please provide us with feedback. Feedback
Mining in Large Noisy Domains
Full text PdfPdf (1.11 MB)
Source
Journal of Data and Information Quality (JDIQ) archive
Volume 1 ,  Issue 2  (September 2009) table of contents
Article No. 8  
Year of Publication: 2009
ISSN:1936-1955
Authors
Manoranjan Dash  Nanyang Technological University, Singapore
Ayush Singhania  Nanyang Technological University, Singapore
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 113,   Downloads (12 Months): 113,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1577840.1577843
What is a DOI?

ABSTRACT

In this article we address the issue of how to mine efficiently in large and noisy data. We propose an efficient sampling algorithm (Concise) as a solution for large and noisy data. Concise is far more superior than the Simple Random Sampling (SRS) in selecting a representative sample. Particularly when the data is very large and noisy, Concise achieves the maximum gain over SRS. The comparison is in terms of their impact on subsequent data mining tasks, specifically, classification, clustering, and association rule mining. We compared Concise with a few existing noise removal algorithms followed by SRS. Although the accuracy of mining results are similar, Concise spends very little time compared to the existing algorithms because Concise has linear time complexity.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. 2003. A framework for clustering evolving data streams. In Proceedings of the International Conference on Very Large Databases (VLDB). 81--92.
 
2
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 487--499.
 
3
Angiulli, F. and Pizzuti, C. 2002. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on the Principles of Data Mining and Knowledge Discovery. 15--26.
 
4
Angluin, D. 1988. Queries and concept learning. Mach. Learn. 2, 4, 319--342.
 
5
Atlas, L., Cohn, D., Ladner, R., El-Sharkawi, M. A., and Marks, I. R. J. Training connectionist networks with queries and selective sampling. Adv. Neural Inform. Process. Syst. 2.
 
6
Bay, S. D. and Schwabacher, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (SIGKDD). 29--38.
 
7
Bradley, P. S., Fayyad, U. M., and Reina, C. A. 1998. Scaling clustering algorithms to large databases. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (SIGKDD). 9--15.
 
8
Brönnimann, H., Chen, B., Dash, M., Haas, P., and Scheuermann, P. 2003. Efficient data reduction with EASE. In Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (SIGKDD). 59--68.
 
9
Chaudhuri, S., Das, G., and Narasayya, V. 2007. Optimized stratified sampling for approximate query processing. ACM Trans. Datab. Syst. 32, 2.
 
10
Chen, B., Haas, P., and Scheuermann, P. 2002. A new two-phase sampling based algorithm for discovering association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD). 462--468.
 
11
Cohn, D. A., Ghahramani, Z., and Jordan, M. I. Active learning with statistical models. Adv. Neural Inform. Process. Syst. 7.
 
12
Cumberland, W. G. and Royall, R. M. 1988. Does simple random sampling provide adequate balance? J. Royal Statist. Soc. Series B (Methodological) 50, 1, 118--124.
 
13
Ester, M., Kriegel, H. P., J. S. X. X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (SIGKDD).
 
14
Fayyad, U. M., Reina, C. A., and Bradley, P. S. 1998. Initialization of iterative refinement clustering algorithms. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (SIGKDD). 194--198.
 
15
Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 73--84.
 
16
Han, J. and Kamber, M. 2006. Data Mining: Concepts and Techniques, 2nd Ed. Morgan Kaufmann Publishers.
 
17
Hodge, V. J. and Austin, J. 2004. A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85--126.
 
18
Hwang, W. and Kim, D. 2006. Improved association rule mining by modified trimming. In Proceedings of the 6th IEEE International Conference on Computer and Information Technology (CIT). 24--24.
 
19
Iyengar, V. S., Apte, C., and Zhang, T. 2000. Active learning using adaptive resampling. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD). 92--98.
 
20
Kaufman, L. and Rousseuw, P. Finding Groups in Data - An Introduction to Cluster Analysis. Wiley Series in Probability and Mathematical Statistics, John Wiley.
 
21
Kerdprasop, N. and Kerdprasop, K. Density estimation technique for data stream classification. In Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA). 662--666.
 
22
Knorr, E. M., Ng, R. T., and Tucakov, V. 2000. Distance-Based outliers: Algorithms and applications. VLDB J. 8, 237--253.
 
23
Kotsiantis, S. and Kanellopoulos, D. 2006. Association rules mining: A recent overview. GESTS Int. Trans. Comput. Sci. Eng. 32, 1, 71--82.
 
24
Kubica, J. and Moore, A. 2003. Probabilistic noise identification and data cleaning. In Proceedings of the International Conference on Data Mining (ICDM). 131--138.
 
25
Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (SIGKDD). 16--22.
 
26
Lewis, D. D. and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the International Conference on Machine Learning (ICML). 148--156.
 
27
Lewis, D. D. and Gale, W. A. 1994. A sequential algorithm for training text classifiers. In Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR). 3--12.
 
28
Li, W., X. G., Zhu, Y., Ramesh, V., and Boult, T. E. 2005. On the small sample performance of boosted classifiers. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). 574--581.
 
29
Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). 346--357.
 
30
Meek, C., Thiesson, B., and Heckerman, D. 2002. The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res. 2, 3, 397--418.
 
31
Ng, R. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 144--155.
 
32
Olken, F. and Rotem, D. 1995. Random sampling from databases - A survey. Statist. Comput. 5, 1, 25--42.
 
33
Plutowski, M. and White, H. 1993. Selecting concise training sets from clean data. IEEE Trans. Neural Netw. 4, 2, 305--318.
 
34
Portnoy, L., Eskin, E., and Stolfo, S. J. 2001. Intrusion detection with unlabeled data using clustering. In Proceedings of the ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001). 5--8.
 
35
Ramaswamy, S., Rastogi, R., and Kyuseok, S. 2000. Efficient algorithms for mining outliers from large datasets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 427--438.
 
36
Saar-Tsechansky, M. and Provost, F. 2001. Active learning for class probability estimation and ranking. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI). 911--920.
 
37
Sarawagi, S. and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD). 269--278.
 
38
Savasere, A., Omiecinski, E., and Navathe, S. 1995. An efficient algorithm for mining association rules in large databases. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB). 432--444.
 
39
Sheikholeslami, G., Chatterjee, S., and Zhang, A. 2000. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of International Conference on Very Large Databases (VLDB). 289--304.
 
40
Scheffer, T. C. D. and Wrobel, S. 2001. Active hidden Markov models for information extraction. In Proceedings of the International Conference on Advances in Intelligent Data Analysis. 309--318.
 
41
Toivonen, H. 1996. Sampling large databases for association rules. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 134--145.
 
42
Tong, S. and Koller, D. 2000. Support vector machine active learning with applications to text classification. In Proceedings of the International Conference on Machine Learning (ICML). 999--1006.
 
43
Valiant, L. G. 1984. A theory of the learnable. Comm. ACM, 27, 11, 1134--1142.
 
44
Xiong, H., Pandey, G., and M. S. V. K. 2006. Enhancing data analysis with noise removal. IEEE Trans. Knowl. Data Eng. 18, 2, 304--319.
 
45
Yates, D. S., Moore, D. S., and Starnes, D. S. 2008. The Practice of Statistics, 3rd Ed. Freeman.
 
46
Zaki, M. J., Parthasarathy, S., Li, W., and Ogihara, M. 1997. Evaluation of sampling for data mining of association rules. In Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE). 42--42.
 
47
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. Birch: An efficient data clustering method for very large databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 103--114.
 
48
Zhu, X. and Wu, X. 2006. Class noise handling for effective cost-sensitive learning by cost-guided iterative classification filtering. IEEE Trans. Knowl. Data Eng. 18, 10, 1435--1440.
 
49
Zhu, X., X. W., Khoshgoftaar, T. M., and Shi, Y. 2007. Empirical study of the noise impact on cost-sensitive learning. In Proceedings of the International Conference on Joint Conference on Artificial Intelligence (IJCAI). 1168--1174.
 
50
Zhu, X., Wu, X., and Chen, Q. Bridging local and global data cleansing: Identifying class noise in large, distributed data datasets. Data Mining Knowl. Discov. 12, 2-3, 275--308.