|
ABSTRACT
In this work we concentrate on categorization of relational attributes based on their data type. Assuming that attribute type/characteristics are unknown or unidentifiable, we analyze and compare a variety of type-based signatures for classifying the attributes based on the semantic type of the data contained therein (e.g., router identifiers, social security numbers, email addresses). The signatures can subsequently be used for other applications as well, like clustering and index optimization/compression. This application is useful in cases where very large data collections that are generated in a distributed, ungoverned fashion end up having unknown, incomplete, inconsistent or very complex schemata and schema level meta-data. We concentrate on heuristically generating type-based attribute signatures based on both local and global computation approaches. We show experimentally that by decomposing data into q-grams and then considering signatures based on q-gram distributions, we achieve very good classification accuracy under the assumption that a large sample of the data is available for building the signatures. Then, we turn our attention to cases where a very small sample of the data is available, and hence accurately capturing the q-gram distribution of a given data type is almost impossible. We propose techniques based on dimensionality reduction and soft-clustering that exploit correlations between attributes to improve classification accuracy.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
AT&T Inc. Business listings from YellowPages.com. proprietary.
|
| |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
J. L. Carter and M. N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143--154, 1979.
|
| |
7
|
C. Clifton, E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In In Proc. of the IFIP Working Conference on Database Semantics (IFIP DS), 1997.
|
| |
8
|
|
| |
9
|
Edith Cohen , Mayur Datar , Shinji Fujiwara , Aristides Gionis , Piotr Indyk , Rajeev Motwani , Jeffrey D. Ullman , Cheng Yang, Finding Interesting Associations without Support Pruning, IEEE Transactions on Knowledge and Data Engineering, v.13 n.1, p.64-78, January 2001
[doi> 10.1109/69.908981]
|
| |
10
|
W. Colley and P. Lohnes. Multivariate data analysis. John Wiley & Sons, 1971.
|
| |
11
|
|
| |
12
|
B. T. Dai, N. Koudas, D. Srivastava, A. K. H. Tung, and S. Venkatasubramanian. Validating multi-column schema matchings by type. In Proc. of International Conference on Data Engineering (ICDE), pages 120--129, 2008.
|
 |
13
|
Tamraparni Dasu , Theodore Johnson , S. Muthukrishnan , Vladislav Shkapenyuk, Mining database structure; or, how to build a data quality browser, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin
[doi> 10.1145/564691.564719]
|
| |
14
|
|
| |
15
|
J. C. Dunn. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3:32--57, 1973.
|
| |
16
|
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183--1210, 1969.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
I. T. Jolliffe. Principal Component Analysis. Springer, second edition, October 2002.
|
| |
21
|
|
| |
22
|
V. Keselj, F. Peng, N. Cercone, and C. Thomas. N-gram-based author profiles for authorship attribution. In Proc. of the Conference Pacific Association for Computational Linguistics (PACLING), pages 255--264, 2003.
|
| |
23
|
M. Ley. DBLP database. http://dblp.uni-trier.de/xml.
|
| |
24
|
A. E. Monge and C. Elkan. The field matching problem: Algorithms and applications. In Proc. of ACM Knowledge Discovery and Data Mining (SIGKDD), pages 267--270, 1996.
|
| |
25
|
|
 |
26
|
|
|