ACM Home Page
Please provide us with feedback. Feedback
Type-based categorization of relational attributes
Full text PdfPdf (585 KB)
Source Extending Database Technology; Vol. 360 archive
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology table of contents
Saint Petersburg, Russia
SESSION: Research sessions: Database summarization table of contents
Pages 84-95  
Year of Publication: 2009
ISBN:978-1-60558-422-5
Authors
Babak Ahmadi  Fraunhofer IAIS
Marios Hadjieleftheriou  AT&T Labs Research
Thomas Seidl  RWTH Aachen University
Divesh Srivastava  AT&T Labs Research
Suresh Venkatasubramanian  University of Utah
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 93,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1516360.1516372
What is a DOI?

ABSTRACT

In this work we concentrate on categorization of relational attributes based on their data type. Assuming that attribute type/characteristics are unknown or unidentifiable, we analyze and compare a variety of type-based signatures for classifying the attributes based on the semantic type of the data contained therein (e.g., router identifiers, social security numbers, email addresses). The signatures can subsequently be used for other applications as well, like clustering and index optimization/compression. This application is useful in cases where very large data collections that are generated in a distributed, ungoverned fashion end up having unknown, incomplete, inconsistent or very complex schemata and schema level meta-data. We concentrate on heuristically generating type-based attribute signatures based on both local and global computation approaches. We show experimentally that by decomposing data into q-grams and then considering signatures based on q-gram distributions, we achieve very good classification accuracy under the assumption that a large sample of the data is available for building the signatures. Then, we turn our attention to cases where a very small sample of the data is available, and hence accurately capturing the q-gram distribution of a given data type is almost impossible. We propose techniques based on dimensionality reduction and soft-clustering that exploit correlations between attributes to improve classification accuracy.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
AT&T Inc. Business listings from YellowPages.com. proprietary.
 
2
3
 
4
 
5
 
6
J. L. Carter and M. N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143--154, 1979.
 
7
C. Clifton, E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In In Proc. of the IFIP Working Conference on Database Semantics (IFIP DS), 1997.
 
8
 
9
 
10
W. Colley and P. Lohnes. Multivariate data analysis. John Wiley & Sons, 1971.
 
11
 
12
B. T. Dai, N. Koudas, D. Srivastava, A. K. H. Tung, and S. Venkatasubramanian. Validating multi-column schema matchings by type. In Proc. of International Conference on Data Engineering (ICDE), pages 120--129, 2008.
13
 
14
 
15
J. C. Dunn. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3:32--57, 1973.
 
16
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183--1210, 1969.
 
17
 
18
 
19
 
20
I. T. Jolliffe. Principal Component Analysis. Springer, second edition, October 2002.
 
21
 
22
V. Keselj, F. Peng, N. Cercone, and C. Thomas. N-gram-based author profiles for authorship attribution. In Proc. of the Conference Pacific Association for Computational Linguistics (PACLING), pages 255--264, 2003.
 
23
M. Ley. DBLP database. http://dblp.uni-trier.de/xml.
 
24
A. E. Monge and C. Elkan. The field matching problem: Algorithms and applications. In Proc. of ACM Knowledge Discovery and Data Mining (SIGKDD), pages 267--270, 1996.
 
25
26
Collaborative Colleagues:
Babak Ahmadi: colleagues
Marios Hadjieleftheriou: colleagues
Thomas Seidl: colleagues
Divesh Srivastava: colleagues
Suresh Venkatasubramanian: colleagues