ACM Home Page
Please provide us with feedback. Feedback
DiMaC: a disguised missing data cleaning tool
Full text PdfPdf (764 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Las Vegas, Nevada, USA
DEMONSTRATION SESSION: Demonstrations table of contents
Pages 1077-1080  
Year of Publication: 2008
ISBN:978-1-60558-193-4
Authors
Ming Hua  Simon Fraser University, Burnaby, BC, Canada
Jian Pei  Simon Fraser University, Burnaby, BC, Canada
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 119,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1401890.1402023
What is a DOI?

ABSTRACT

In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, which may impair the quality of data analysis severely. The very limited previous studies on cleaning disguised missing data highly rely on domain background knowledge in specific applications and may not work well for the cases where the disguise values are inliers.

Recently, we have studied the problem of cleaning disguised missing data systematically, and proposed an effective heuristic approach [2]. In this paper, we present a demonstration of DiMaC, a Disguised Missing Data Cleaning tool which can find the frequently used disguise values in data sets without any domain background knowledge. In this demo, we will show (1) the critical techniques of finding suspicious disguise values; (2) the architecture and user interface of DiMaC system; (3) an empirical case study on both real and synthetic data sets, which verifies the effectiveness and the efficiency of the techniques; and (4) some challenges arising from real applications and several direction for future work.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
D. DesJardins. Outliers, inliers, and just plain liars - new graphical EDA+ (EDA Plus) techniques for understanding data. In Proc. SAS User's Group International Conference (SUGI26), Long Beach, CA, 2001.
2
 
3
B. Kégl and L. Wang. Boosting on manifolds: Adaptive regularization of base classifiers. In Lawrence K. Saul, Yair Weiss, and Leon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 665--672, Cambridge, MA, 2005. MIT Press.
 
4
 
5
 
6
R. Pearson. Mining imperfect data: Dealing with contamination and incomplete records. In Proc. 2005 SIAM Int. Conf. Data Mining, New Port Beach, CA, April 2005.
 
7
R. K. Pearson. Data mining in the face of contaminated and incomplete records. In Proc. 2002 SIAM Int. Conf. Data Mining, Arlington, VA, April 2002.
8
 
9
G. Webb. Further experimental evidence against the utility of occam's razor. The Journal of Artifial Intelligence Research, 4:397--417, 1996.