|
ABSTRACT
The availability of summary data for XML documents has many applications, from providing users with quick feedback about their queries, to cost-based storage design and query optimization. StatiX is a novel XML Schema-aware statistics framework that exploits the structure derived by regular expressions (which define elements in an XML Schema) to pinpoint places in the schema that are likely sources of structural skew. As we discuss below, this information can be used to build concise, yet accurate, statistical summaries for XML data. StatiX leverages standard XML technology for gathering statistics, notably XML Schema validators, and it uses histograms to summarize both the structure and values in an XML document. In this paper we describe the StatiX system. We develop algorithms that decompose schemas to obtain statistics at different granularities and discuss how statistics can be gathered as documents are validated. We also present an experimental evaluation which demonstrates the accuracy and scalability of our approach and show an application of these statistics to cost-based XML storage design.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
D. Chambelin, J. Clark, D. Florescu, Jonathan Robie, J. Siméon, and M. Stefanescu. XQuery 1.0: An XML query language. W3C Working Draft, June 2001.
|
| |
6
|
Zhiyuan Chen , H. V. Jagadish , Flip Korn , Nick Koudas , S. Muthukrishnan , Raymond T. Ng , Divesh Srivastava, Counting Twig Matches in a Tree, Proceedings of the 17th International Conference on Data Engineering, p.595-604, April 02-06, 2001
|
 |
7
|
Alin Deutsch , Mary Fernandez , Dan Suciu, Storing semistructured data with STORED, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.431-442, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
8
|
P. Fankhauser, M. Fernandez, A. Malhotra, M. Rys, J. Siméon, and P. Wadler. The XML query algebra, February 2001. http://www.w3.org/TR/2001/WD-query-algebra-20010215.
|
| |
9
|
Galax system, October 2001. http://db.bell-labs.com/galax/.
|
| |
10
|
|
| |
11
|
Internet Movie Database. http://www.imdb.com.
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
XML query language (xql). http://www.oasis-open.org, 2001.
|
 |
16
|
|
 |
17
|
|
| |
18
|
|
 |
19
|
Viswanath Poosala , Peter J. Haas , Yannis E. Ioannidis , Eugene J. Shekita, Improved histograms for selectivity estimation of range predicates, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.294-305, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
20
|
|
 |
21
|
P. Griffiths Selinger , M. M. Astrahan , D. D. Chamberlin , R. A. Lorie , T. G. Price, Access path selection in a relational database management system, Proceedings of the 1979 ACM SIGMOD international conference on Management of data, May 30-June 01, 1979, Boston, Massachusetts
[doi> 10.1145/582095.582099]
|
| |
22
|
Jayavel Shanmugasundaram , Kristin Tufte , Chun Zhang , Gang He , David J. DeWitt , Jeffrey F. Naughton, Relational Databases for Querying XML Documents: Limitations and Opportunities, Proceedings of the 25th International Conference on Very Large Data Bases, p.302-314, September 07-10, 1999
|
| |
23
|
H. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures. W3C Working Draft, February 2000.
|
| |
24
|
|
| |
25
|
Xerces java parser 1.4.3. http://xml.apache.org/xerces-j/.
|
| |
26
|
Xmark. http://monetdb.cwi.nl/xml.
|
CITED BY 29
|
|
|
|
|
Serge Abiteboul , Angela Bonifati , Grégory Cobéna , Ioana Manolescu , Tova Milo, Dynamic XML documents with distribution and replication, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California
|
|
|
|
|
|
|
|
|
|
|
|
Ning Zhang , Peter J. Haas , Vanja Josifovski , Guy M. Lohman , Chun Zhang, Statistical learning techniques for costing XML queries, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
|
|
|
|
|
A. Balmin , T. Eliaz , J. Hornibrook , L. Lim , G. M. Lohman , D. Simmen , M. Wang , C. Zhang, Cost-based optimization in DB2 XML, IBM Systems Journal, v.45 n.2, p.299-319, January 2006
|
|
|
|
|
|
|
|
|
Geert Jan Bex , Wouter Gelade , Wim Martens , Frank Neven, Simplifying XML schema: effortless handling of nondeterministic regular expressions, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
Wei Wang , Haifeng Jiang , Hongjun Lu , Jeffrey Xu Yu, Bloom histogram: path selectivity estimation for XML data with updates, Proceedings of the Thirtieth international conference on Very large data bases, p.240-251, August 31-September 03, 2004, Toronto, Canada
|
|
|
Alan Halverson , Josef Burger , Leonidas Galanis , Ameet Kini , Rajasekar Krishnamurthy , Ajith Nagaraja Rao , Feng Tian , Stratis D. Viglas , Yuan Wang , Jeffrey F. Naughton , David J. DeWitt, Mixed mode XML query processing, Proceedings of the 29th international conference on Very large data bases, p.225-236, September 09-12, 2003, Berlin, Germany
|
|
|
|
|
|
Philip Bohannon , Juliana Freire , Jayant R. Haritsa , Prasan Roy , Jérôme Siméon, LegoDB: customizing relational storage for XML documents, Proceedings of the 28th international conference on Very Large Data Bases, p.1091-1094, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Riham Abdel Kader , Peter Boncz , Stefan Manegold , Maurice van Keulen, ROX: run-time optimization of XQueries, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
Cheng Luo , Zhewei Jiang , Wen-Chi Hou , Feng Yu , Qiang Zhu, A sampling approach for XML query selectivity estimation, Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, March 24-26, 2009, Saint Petersburg, Russia
|
|
|
|
|
|
|
|