|
ABSTRACT
XML is widely recognized as the data interchange standard of tomorrow because of its ability to represent data from a variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this article, we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently, an approximate match in structure, in addition to content, has to be folded into the join operation. We quantify an approximate match in structure and content for pairs of XML documents using well defined notions of distance. We show how notions of distance that have metric properties can be incorporated in a framework for joins between XML data sources and introduce the idea of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set, and we propose sampling-based algorithms to identify them. We then instantiate our join framework using the tree edit distance between a pair of trees. We next turn our attention to utilizing well known index structures to improve the performance of approximate XML join operations. We present a methodology enabling adaptation of index structures for this problem, and we instantiate it in terms of the R-tree. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets, varying parameters of interest, and highlighting the performance benefits of our approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
Norbert Beckmann , Hans-Peter Kriegel , Ralf Schneider , Bernhard Seeger, The R*-tree: an efficient and robust access method for points and rectangles, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.322-331, May 23-26, 1990, Atlantic City, New Jersey, United States
|
| |
4
|
Bille, P. 2003. Tree Edit Distance, Alignment Distance and Inclusion. Tech. rep. TR-2003-23 IT, University of Copenhagen.
|
 |
5
|
Thomas Brinkhoff , Hans-Peter Kriegel , Bernhard Seeger, Efficient processing of spatial joins using R-trees, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.237-246, May 25-28, 1993, Washington, D.C., United States
|
 |
6
|
|
 |
7
|
|
 |
8
|
|
 |
9
|
Sudarshan S. Chawathe , Anand Rajaraman , Hector Garcia-Molina , Jennifer Widom, Change detection in hierarchically structured information, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.493-504, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
10
|
|
| |
11
|
|
| |
12
|
Faloutsos, C. 1996. Indexing Multimedia Databases. Kluwer, Academic Publishing.
|
 |
13
|
Christos Faloutsos , M. Ranganathan , Yannis Manolopoulos, Fast subsequence matching in time-series databases, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.419-429, May 24-27, 1994, Minneapolis, Minnesota, United States
|
 |
14
|
|
| |
15
|
|
| |
16
|
Garhaldas, H., Florescu, D., Shasha, D., Simon, E., and Saita, E. 2001. Declerative data cleaning. Proceedings of VLDB. 133--145.
|
| |
17
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
| |
18
|
|
 |
19
|
|
 |
20
|
Sudipto Guha , H. V. Jagadish , Nick Koudas , Divesh Srivastava , Ting Yu, Approximate XML joins, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin
[doi> 10.1145/564691.564725]
|
| |
21
|
Guha, S., Koudas, N., Srivastava, D., and Yu, T. 2003. Index based approximate XML joins. International Conference on Data Engineering (ICDE). 303--306.
|
 |
22
|
Sudipto Guha , Rajeev Rastogi , Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States
|
| |
23
|
|
| |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
|
| |
29
|
Levenshtein, V. 1966. Binary codes capable of correcting insertions, deletions and reversals. Cybernetics and Control Theory. 707--710.
|
| |
30
|
|
 |
31
|
|
| |
32
|
|
| |
33
|
Sankoff, D. and Kruskal, J. 1983. Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA.
|
| |
34
|
Sarawagi, S. 2000. Special issue on data cleaning. IEEE Data Engin. Bull. 23, 4.
|
| |
35
|
|
| |
36
|
|
 |
37
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
|