ACM Home Page
Please provide us with feedback. Feedback
Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems
Full text PdfPdf (856 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining table of contents
Chicago, Illinois, USA
POSTER SESSION: Industry/government track poster table of contents
Pages: 756 - 762  
Year of Publication: 2005
ISBN:1-59593-135-X
Authors
Daniel R. Jeske  University of California, Riverside, CA
Behrokh Samadi  Lucent Technologies, Holmdel, NJ
Pengyue J. Lin  University of California, Riverside, CA
Lan Ye  University of California, Riverside, CA
Sean Cox  University of California, Riverside, CA
Rui Xiao  University of California, Riverside, CA
Ted Younglove  University of California, Riverside, CA
Minh Ly  University of California, Riverside, CA
Douglas Holt  University of California, Riverside, CA
Ryan Rich  University of California, Riverside, CA
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 90,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1081870.1081969
What is a DOI?

ABSTRACT

Information Discovery and Analysis Systems (IDAS) are designed to correlate multiple sources of data and use data mining techniques to identify potential significant events. Application domains for IDAS are numerous and include the emerging area of homeland security.Developing test cases for an IDAS requires background data sets into which hypothetical future scenarios can be overlaid. The IDAS can then be measured in terms of false positive and false negative error rates. Obtaining the test data sets can be an obstacle due to both privacy issues and also the time and cost associated with collecting a diverse set of data sources.In this paper, we give an overview of the design and architecture of an IDAS Data Set Generator (IDSG) that enables a fast and comprehensive test of an IDAS. The IDSG generates data using statistical and rule-based algorithms and also semantic graphs that represent interdependencies between attributes. A credit card transaction application is used to illustrate the approach.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Abowd, J.M. and Lane, J.I. Synthetic Data and Confidentiality Protection. U.S. Census Bureau, LEHD Program Technical Paper No. TP-2003-10, (2003).
 
2
 
3
Department of Defense, Office of the Inspector General. Information Technology Management: Terrorism Information Awareness Program. Report No. D-2004-033. (2004).
 
4
General Accounting Office, Data Mining: Federal Efforts Cover a Wide Range of Uses. GAO-04-548. (2004).
 
5
Kusiak, A., Kernstine, K.H., Kern, J.A., McLaughlin, K.A., and Tseng, T.L. Data Mining: Medical and Engineering Case Studies. Proceedings of the Industrial Engineering Research 2000 Conference, Cleveland, Ohio, May 21-23, (2000), 1--7.
 
6
Leskovec, J. Grobelnik, M., and Millic-Frayling, N. Learning Sub-structures of Document Semantic Graphs for Document Summarization. LinkKDD 2004, August 2004, Seattle WA, USA. (2004).
7
 
8
Prince, E., and Nicholson, W.L. A Test of a Robust/Resistant Refinement Procedure on Synthetic Data. Acta Cryst., A39, (1983), 407--410.
 
9
Rogers, M. Graham, J., and Tonge, R.P. Using Statistical Image Models for Objective Evaluation of Spot Detection in Two-Dimensional Gels. Proteomics, June, 3(6) (2003), 879--886.
 
10
 
11
 
12
Yun, W.T., Stefanova, L., Mitra, A.K., and Krishnamurti, T.N.. Multi-Model Synthetic Superensemble Prediction System. Acta Cryst., A39, (1983), 407--410.
 
13
Zhu, X., Aref, W.G., Fan, J., Catlin, A.C., and Elmagarmid, A.K. Medical Video Mining for Efficient Database Indexing, Management, and Access. IEEE Int. Conf. On Data Engineering (ICDE '03), Bangalore, India, March 5-March 8, (2003), 1--12.

Collaborative Colleagues:
Daniel R. Jeske: colleagues
Behrokh Samadi: colleagues
Pengyue J. Lin: colleagues
Lan Ye: colleagues
Sean Cox: colleagues
Rui Xiao: colleagues
Ted Younglove: colleagues
Minh Ly: colleagues
Douglas Holt: colleagues
Ryan Rich: colleagues