ACM Home Page
Please provide us with feedback. Feedback
Discovering topical structures of databases
Full text PdfPdf (630 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data table of contents
Vancouver, Canada
SESSION: Research Session 21: Provenance, Integration and Extraction table of contents
Pages 1019-1030  
Year of Publication: 2008
ISBN:978-1-60558-102-6
Authors
Wensheng Wu  IBM Almaden Research Center, San Jose, CA, USA
Berthold Reinwald  IBM Almaden Research Center, San Jose, CA, USA
Yannis Sismanis  IBM Almaden Research Center, San Jose, CA, USA
Rajesh Manjrekar  IBM Almaden Research Center, San Jose, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 264,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376616.1376717
What is a DOI?

ABSTRACT

The increasing complexity of enterprise databases and the prevalent lack of documentation incur significant cost in both understanding and integrating the databases. Existing solutions addressed mining for keys and foreign keys, but paid little attention to more high-level structures of databases. In this paper, we consider the problem of discovering topical structures of databases to support semantic browsing and large-scale data integration. We describe iDisc, a novel discovery system based on a multi-strategy learning framework. iDisc exploits varied evidence in database schema and instance values to construct multiple kinds of database representations. It employs a set of base clusterers to discover preliminary topical clusters of tables from database representations, and then aggregate them into final clusters via meta-clustering. To further improve the accuracy, we extend iDisc with novel multiple-level aggregation and clusterer boosting techniques. We introduce a new measure on table importance and propose an approach to discovering cluster representatives to facilitate semantic browsing. An important feature of our framework is that it is highly extensible, where additional database representations and base clusterers may be easily incorporated into the framework. We have extensively evaluated iDisc using large real-world databases and results show that it discovers topical structures with a high degree of accuracy.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
CA ERwin Data Modeler (www.ca.com).
 
2
IBM Rational Data Modeler (www.ibm.com).
3
 
4
 
5
U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2):163--177, 2001.
 
6
P. Brown et al. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, 2005.
7
 
8
F. Chung. Spectral Graph Theory. American Mathematical Society, 1997.
 
9
C. Clifton, E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In IFIP, 1997.
10
11
 
12
 
13
P. Feldman and D. Miller. Entity model clustering: Structuring a data model by abstraction. The Computer Journal, 29(4):348--360, 1986.
 
14
 
15
 
16
L. Haas. Beauty and the beast: The theory and practice of information integration. In ICDT, 2007.
 
17
 
18
 
19
IBM Corporation. Attributes & Capabilities Study, 2006.
 
20
T. Johnson, A. Marathe, and T. Dasu. Database exploration and Bellman. Data Eng. Bull., 26(3), 2003.
 
21
A. Koeller and E. Rundensteiner. Heuristic strategies for the discovery of inclusion dependencies and other patterns. Journal on Data Semantics, V, 2006.
 
22
 
23
R. Michalski and G. Tecuci, editors. Machine Learning: A Multistrategy Approach. Morgan Kaufmann, 1994.
 
24
M. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 2004.
 
25
26
 
27
 
28
29
30
 
31


Collaborative Colleagues:
Wensheng Wu: colleagues
Berthold Reinwald: colleagues
Yannis Sismanis: colleagues
Rajesh Manjrekar: colleagues