| Discovering topical structures of databases |
| Full text |
Pdf
(630 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
table of contents
Vancouver, Canada
SESSION: Research Session 21: Provenance, Integration and Extraction
table of contents
Pages 1019-1030
Year of Publication: 2008
ISBN:978-1-60558-102-6
|
|
Authors
|
|
Wensheng Wu
|
IBM Almaden Research Center, San Jose, CA, USA
|
|
Berthold Reinwald
|
IBM Almaden Research Center, San Jose, CA, USA
|
|
Yannis Sismanis
|
IBM Almaden Research Center, San Jose, CA, USA
|
|
Rajesh Manjrekar
|
IBM Almaden Research Center, San Jose, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 18, Downloads (12 Months): 307, Citation Count: 2
|
|
|
ABSTRACT
The increasing complexity of enterprise databases and the prevalent lack of documentation incur significant cost in both understanding and integrating the databases. Existing solutions addressed mining for keys and foreign keys, but paid little attention to more high-level structures of databases. In this paper, we consider the problem of discovering topical structures of databases to support semantic browsing and large-scale data integration. We describe iDisc, a novel discovery system based on a multi-strategy learning framework. iDisc exploits varied evidence in database schema and instance values to construct multiple kinds of database representations. It employs a set of base clusterers to discover preliminary topical clusters of tables from database representations, and then aggregate them into final clusters via meta-clustering. To further improve the accuracy, we extend iDisc with novel multiple-level aggregation and clusterer boosting techniques. We introduce a new measure on table importance and propose an approach to discovering cluster representatives to facilitate semantic browsing. An important feature of our framework is that it is highly extensible, where additional database representations and base clusterers may be easily incorporated into the framework. We have extensively evaluated iDisc using large real-world databases and results show that it discovers topical structures with a high degree of accuracy.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
CA ERwin Data Modeler (www.ca.com).
|
| |
2
|
IBM Rational Data Modeler (www.ibm.com).
|
 |
3
|
|
| |
4
|
|
| |
5
|
U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2):163--177, 2001.
|
| |
6
|
P. Brown et al. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, 2005.
|
 |
7
|
|
| |
8
|
F. Chung. Spectral Graph Theory. American Mathematical Society, 1997.
|
| |
9
|
C. Clifton, E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In IFIP, 1997.
|
 |
10
|
Tamraparni Dasu , Theodore Johnson , S. Muthukrishnan , Vladislav Shkapenyuk, Mining database structure; or, how to build a data quality browser, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin
[doi> 10.1145/564691.564719]
|
 |
11
|
AnHai Doan , Pedro Domingos , Alon Y. Halevy, Reconciling schemas of disparate data sources: a machine-learning approach, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.509-520, May 21-24, 2001, Santa Barbara, California, United States
|
| |
12
|
|
| |
13
|
P. Feldman and D. Miller. Entity model clustering: Structuring a data model by abstraction. The Computer Journal, 29(4):348--360, 1986.
|
| |
14
|
|
| |
15
|
|
| |
16
|
L. Haas. Beauty and the beast: The theory and practice of information integration. In ICDT, 2007.
|
| |
17
|
|
| |
18
|
|
| |
19
|
IBM Corporation. Attributes & Capabilities Study, 2006.
|
| |
20
|
T. Johnson, A. Marathe, and T. Dasu. Database exploration and Bellman. Data Eng. Bull., 26(3), 2003.
|
| |
21
|
A. Koeller and E. Rundensteiner. Heuristic strategies for the discovery of inclusion dependencies and other patterns. Journal on Data Semantics, V, 2006.
|
| |
22
|
|
| |
23
|
R. Michalski and G. Tecuci, editors. Machine Learning: A Multistrategy Approach. Morgan Kaufmann, 1994.
|
| |
24
|
M. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 2004.
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
|
 |
29
|
|
 |
30
|
|
| |
31
|
|
|