|
ABSTRACT
We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L1 vector space while guaranteeing a (worst-case) upper bound of O(log2n log*n) on the distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and real-life XML data trees validate our approach, demonstrating that the average-case behavior of our embedding techniques is much better than what would be predicted from our theoretical worst-case distortion bounds. To the best of our knowledge, these are the first algorithmic results on low-distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Swarup Acharya , Phillip B. Gibbons , Viswanath Poosala , Sridhar Ramaswamy, Join synopses for approximate query answering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.275-286, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
2
|
Noga Alon , Phillip B. Gibbons , Yossi Matias , Mario Szegedy, Tracking join and self-join sizes in limited storage, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.10-20, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.303978]
|
 |
3
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237814.237823]
|
| |
4
|
|
| |
5
|
|
 |
6
|
Arvind Arasu , Brian Babcock , Shivnath Babu , Jon McAlister , Jennifer Widom, Characterizing memory requirements for queries over continuous data streams, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543642]
|
 |
7
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
Cormode, G., Datar, M., Indyk, P., and Muthukrishnan, S. 2002a. Comparing data streams using hamming norms (how to zero in). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 335--345.
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
| |
18
|
Diao, Y. and Franklin, M. 2003. Query processing for high-volume XML message brokering. In Proceedings of the 29th International Conference on Very Large Data Bases (Berlin, Germany).
|
 |
19
|
|
| |
20
|
Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2004. Sketch-based multi-query processing over data streams. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'2004, Heraklion-Crete, Greece).
|
| |
21
|
|
| |
22
|
|
| |
23
|
Ganguly, S., Garofalakis, M., and Rastogi, R. 2004. Processing data-stream join aggregates using skimmed sketches. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'2004, Heraklion-Crete, Greece).
|
| |
24
|
Garofalakis, M., Gehrke, J., and Rastogi, R. 2002. Querying and mining data streams: you only get one look (Tutorial). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China).
|
| |
25
|
|
 |
26
|
|
 |
27
|
Anna C. Gilbert , Sudipto Guha , Piotr Indyk , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, Fast, small-space algorithms for approximate histogram maintenance, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, May 19-21, 2002, Montreal, Quebec, Canada
[doi> 10.1145/509907.509966]
|
| |
28
|
|
| |
29
|
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002b. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 454--465.
|
| |
30
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
 |
31
|
|
 |
32
|
Sudipto Guha , H. V. Jagadish , Nick Koudas , Divesh Srivastava , Ting Yu, Approximate XML joins, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin
[doi> 10.1145/564691.564725]
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
|
| |
37
|
|
| |
38
|
Johnson, W. B. and Lindenstrauss, J. 1984. Extensions of lipschitz mappings into Hilbert space. Contemp. Math. 26, 189--206.
|
| |
39
|
|
| |
40
|
Knuth, D. E. 1973. The Art of Computer Programming (Vol. 1/Fundamental Algorithms). Addison-Wesley, Reading, MA.
|
| |
41
|
|
| |
42
|
Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 346--357.
|
| |
43
|
|
| |
44
|
Nolan, J. P. 2004. Stable distributions: Models for heavy-tailed data. Available online at http://academic2.american.edu/~jpnolan/stable/stable.html.
|
 |
45
|
|
 |
46
|
|
| |
47
|
Schmidt, A., Waas, F., Kersten, M., Carey, M. J., Manolescu, I., and Busse, R. 2002. XMark: A benchmark for XML data management. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China).
|
| |
48
|
|
 |
49
|
|
| |
50
|
Uchaikin, V. V. and Zolotarev, V. M. 1999. Chance and Stability : Stable Distributions and their Applications. VSP, Utrecht, The Netherland.
|
| |
51
|
|
 |
52
|
|
| |
53
|
|
|