|
ABSTRACT
Applications that manage semi-structured data are becoming increasingly commonplace. Current approaches for storing semi-structured data use existing storage machinery; they either map the data to relational databases, or use a combination of flat files and indexes. While employing these existing storage mechanisms provides readily available solutions, there is a need to more closely examine their suitability to this class of data. Particularly, retrofitting existing solutions for semi-structured data can result in a mismatch between the tree structure of the data and the access characteristics of the underlying storage device (disk drive). This study explores various possibilities in the design space of native storage solutions for semi-structured data by exploring alternative approaches that match application data access characteristics to those of the underlying disk drive. For evaluating the effectiveness of the proposed native techniques in relation to the existing solution, we experiment with XML data using the XPathMark benchmark. Extensive evaluation reveals the strengths and weaknesses of the proposed native data layout techniques. While the existing solutions work really well for deep-focused queries into a semi-structured document (those that result in retrieving entire subtrees), the proposed native solutions substantially outperform for the non-deep-focused queries, which we demonstrate are at least as important as the deep-focused. We believe that native data layout techniques offer a unique direction for improving the performance of semi-structured data stores for a variety of important workloads. However, given that the proposed native techniques require circumventing current storage stack abstractions, further investigation is warranted before they can be applied to general-purpose storage systems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Serge Abiteboul , Rakesh Agrawal , Phil Bernstein , Mike Carey , Stefano Ceri , Bruce Croft , David DeWitt , Mike Franklin , Hector Garcia Molina , Dieter Gawlick , Jim Gray , Laura Haas , Alon Halevy , Joe Hellerstein , Yannis Ioannidis , Martin Kersten , Michael Pazzani , Mike Lesk , David Maier , Jeff Naughton , Hans Schek , Timos Sellis , Avi Silberschatz , Mike Stonebraker , Rick Snodgrass , Jeff Ullman , Gerhard Weikum , Jennifer Widom , Stan Zdonik, The Lowell database research self-assessment, Communications of the ACM, v.48 n.5, p.111-118, May 2005
[doi> 10.1145/1060710.1060718]
|
| |
2
|
Afanasiev, L., Manolescu, I., and Michiels, P. 2005. Member: A micro-benchmark repository for XQuery. In Proceedings of the 3rd International XML Database Symposium on Database and XML Technologies (XSym'05). S. Bressan et al., Eds. Lecture Notes in Computer Science, vol. 3671. Springer, 144--161.
|
| |
3
|
Afanasiev, L. and Marx, M. 2006. An analysis of the current xquery benchmarks. In ExpDB. 9--20.
|
| |
4
|
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 3, 403--410.
|
| |
5
|
Barbosa, D., Barta, A., Mendelzon, A. O., Mihaila, G. A., Rizzolo, F., and Rodriguez-Guianolli, P. 2001. Tox - The Toronto XML engine. In Proceedings of the Workshop on Information Integration on the Web. 66--73.
|
| |
6
|
Bedathur, S. and Haritsa, J. 2006. Search-Optimized suffix-tree storage for biological applications. In Proceedings of the12th IEEE International Conference on High Performance Computing (HiPC). D. A. Bader et al., Eds. Lecture Notes in Computer Science, vol. 3769, 29--39.
|
 |
7
|
Kevin Beyer , Roberta J. Cochrane , Vanja Josifovski , Jim Kleewein , George Lapis , Guy Lohman , Bob Lyle , Fatma Özcan , Hamid Pirahesh , Normen Seemann , Tuong Truong , Bert Van der Linden , Brian Vickery , Chun Zhang, System RX: one part relational, one part XML, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
[doi> 10.1145/1066157.1066197]
|
| |
8
|
Bhadkamkar, M., Farfan, F., Hristidis, V., and Rangaswami, R. 2006. Efficient native storage for semi-structured data (extended paper version). http://www.cis.fiu.edu/SSS/NativeXMLextended.pdf.
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
 |
12
|
Ying Guang Li , Stéphane Bressan , Gillian Dobbie , Zoé Lacroix , Mong Li Lee , Ullas Nambiar , Bimlesh Wadhwa, XOO7: applying OO7 benchmark to XML query processing tool, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA
[doi> 10.1145/502585.502614]
|
| |
13
|
Bucy, J., Ganger, G., and Contributors. 2003. The DiskSim simulation environment version 3.0 reference manual. Tech. rep. CMU-CS-03-102, Carnegie Mellon University.
|
 |
14
|
Michael J. Carey , David J. DeWitt , Michael J. Franklin , Nancy E. Hall , Mark L. McAuliffe , Jeffrey F. Naughton , Daniel T. Schuh , Marvin H. Solomon , C. K. Tan , Odysseas G. Tsatalos , Seth J. White , Michael J. Zwilling, Shoring up persistent applications, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.383-394, May 24-27, 1994, Minneapolis, Minnesota, United States
|
| |
15
|
CDA. 2007. HL7 clinical document architecture, release 2.0. http://lists.hl7.org/read/attachment/61225/1/CDA-doc 20version.pdf. 2007.
|
| |
16
|
Delcher, A., Kasif, S., Fleischmann, R., Peterson, J., White, O., and Salzberg, S. 1999. Alignment of whole genomes. Nucleic Acids Res. 27, 11, 2369--2376.
|
 |
17
|
Alin Deutsch , Mary Fernandez , Dan Suciu, Storing semistructured data with STORED, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.431-442, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
18
|
Dimitrijevic, Z., Rangaswami, R., Chang, E., Watson, D., and Acharya, A. 2004. Diskbench: User-Level disk feature extraction tool. Tech. rep. TR-2004-18, University of California at Santa Barbara.
|
| |
19
|
Dolin, R. H., Alschuler, L., Boyer, S., Beebe, C., Behlen, F. M., Biron, P. V., and Shabo Shvo, A. 2006. HL7 clinical document architecture release 2. J. Amer. Med. Inf. Assoc. 13, 1.
|
| |
20
|
|
| |
21
|
Farfan, F., Hristidis, V., and Rangaswami, R. 2007. Beyond lazy XML parsing. In Proceedings of International Conference on Database and Expert Systems Applications (DEXA).
|
 |
22
|
|
| |
23
|
Franceschet, M. 2004. XPathMark: An XPath benchmark for XMark. Tech. rep. PP-2004-04, University of Amsterdam.
|
| |
24
|
Franceschet, M. 2005. XPathMark: An XPath benchmark for the XMark generated data. Lecture Notes in Computer Science, vol. 3671. Springer, 129--143.
|
| |
25
|
Galax. 2007. Galax. 2007. Galax homepage. http://www.galaxquery.org.
|
| |
26
|
Ganger, G. R. 2001. Blurring the line between OSes and storage devices. Tech. rep. CMU-CS-01-166, Carnegie Mellon University.
|
 |
27
|
Garth A. Gibson , David F. Nagle , Khalil Amiri , Jeff Butler , Fay W. Chang , Howard Gobioff , Charles Hardin , Erik Riedel , David Rochberg , Jim Zelenka, A cost-effective, high-bandwidth storage architecture, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.92-103, October 02-07, 1998, San Jose, California, United States
|
| |
28
|
GML. 2008. Geography markup language. http://opengis.net/gml/.
|
| |
29
|
|
| |
30
|
HL7. 2008. Health level seven XML. http://www.hl7.org/special/Committees/xml/xml.htm.
|
| |
31
|
Larry Huston , Rahul Sukthankar , Rajiv Wickremesinghe , M. Satyanarayanan , Gregory R. Ganger , Erik Riedel , Anastassia Ailamaki, Diamond: A Storage Architecture for Early Discard in Interactive Search, Proceedings of the 3rd USENIX Conference on File and Storage Technologies, March 31-31, 2004, San Francisco, CA
|
 |
32
|
|
| |
33
|
H. V. Jagadish , S. Al-Khalifa , A. Chapman , L. V. S. Lakshmanan , A. Nierman , S. Paparizos , J. M. Patel , D. Srivastava , N. Wiwatwattana , Y. Wu , C. Yu, TIMBER: A native XML database, The VLDB Journal — The International Journal on Very Large Data Bases, v.11 n.4, p.274-291, December 2002
[doi> 10.1007/s00778-002-0081-x]
|
| |
34
|
Kanne, C. and Moerkotte, G. 1999. Efficient storage of XML data. Tech. rep., Universitaet Mannheim.
|
 |
35
|
|
| |
36
|
|
 |
37
|
|
 |
38
|
|
| |
39
|
Kundu, S. and Misra, J. 1977. A linear tree partition algorithm. SIAM J. Comput. 6, 1,151--154.
|
| |
40
|
Li, Q. and Moon, B. 2001. Indexing and querying XML data for regular path expressions. VLDB J.
|
| |
41
|
Manolescu, I., Miachon, C., and Michiels, P. 2006. Towards micro-benchmarking Xquery. In Proceedings of the International Workshop on Performance and Evaluation of Data Management Systems (ExpDB), 28--39.
|
 |
42
|
|
| |
43
|
Xiaofeng Meng , Daofeng Luo , Mong Li Lee , Jing An, OrientStore: a schema based native XML storage system, Proceedings of the 29th international conference on Very large data bases, p.1057-1060, September 09-12, 2003, Berlin, Germany
|
| |
44
|
Mergen, S. L. S. and Heuser, C. A. 2004. Matching of XML schemas and relational schemas. In Proceedings of the Brazilian Symposium on Databases (SBBD).
|
| |
45
|
MML. 2008. Medical markup language. http://www.ncbi.nlm.nih.gov/.
|
| |
46
|
Nambiar, U., Lacroix, Z., Bressan, S., Lee, M., and Li, Y. 2001. XML benchmarks put to the test. http://www.comp.nus.edu.sg/~liyg/publication/iiwas01.pdf.
|
| |
47
|
|
 |
48
|
|
 |
49
|
|
 |
50
|
|
| |
51
|
ODS. 2008. Open document specification v1.0. http://www.oasis-open.org/committees/download.php/12572/OpenDocument-v1.0-os.pdf.
|
| |
52
|
OOX. 2008. Openoffice XML file format v1.0.
|
| |
53
|
|
| |
54
|
Ramanath, M., Freire, J., Haritsa, J., and Roy, P. 2003. Searching for efficient XML to relational mappings. Tech. rep. TR-2003-01, DSL/SERC.
|
| |
55
|
|
| |
56
|
Rokhsar, D. 2007. Computational analysis of genomic sequence data. http://www.nersc.gov/news/annual reports/annrep01/sh BER 06.html.
|
| |
57
|
|
| |
58
|
|
| |
59
|
Jiri Schindler , Steven W. Schlosser , Minglong Shao , Anastassia Ailamaki , Gregory R. Ganger, Atropos: A Disk Array Volume Manager for Orchestrated Use of Disks, Proceedings of the 3rd USENIX Conference on File and Storage Technologies, March 31-31, 2004, San Francisco, CA
|
| |
60
|
Steven W. Schlosser , Jiri Schindler , Stratos Papadomanolakis , Minglong Shao , Anastassia Ailamaki , Christos Faloutsos , Gregory R. Ganger, On multidimensional data and modern disks, Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies, p.17-17, December 13-16, 2005, San Francisco, CA
|
| |
61
|
Albrecht Schmidt , Florian Waas , Martin Kersten , Michael J. Carey , Ioana Manolescu , Ralph Busse, XMark: a benchmark for XML data management, Proceedings of the 28th international conference on Very Large Data Bases, p.974-985, August 20-23, 2002, Hong Kong, China
|
| |
62
|
Jayavel Shanmugasundaram , Kristin Tufte , Chun Zhang , Gang He , David J. DeWitt , Jeffrey F. Naughton, Relational Databases for Querying XML Documents: Limitations and Opportunities, Proceedings of the 25th International Conference on Very Large Data Bases, p.302-314, September 07-10, 1999
|
| |
63
|
Muthian Sivathanu , Vijayan Prabhakaran , Florentina I. Popovici , Timothy E. Denehy , Andrea C. Arpaci-Dusseau , Remzi H. Arpaci-Dusseau, Semantically-Smart Disk Systems, Proceedings of the 2nd USENIX Conference on File and Storage Technologies, March 31-31, 2003, San Francisco, CA
|
| |
64
|
SVG. 2008. Scalable vector graphics. http://www.w3.org/Graphics/SVG/.
|
| |
65
|
|
 |
66
|
Bruce L. Worthington , Gregory R. Ganger , Yale N. Patt , John Wilkes, On-line extraction of SCSI disk drive parameters, Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, p.146-156, May 15-19, 1995, Ottawa, Ontario, Canada
|
| |
67
|
Xalan. 2007. Xalan-Java. http://xml.apache.org/xalan-j.
|
| |
68
|
XPath. 2007. XML path language (XPath) version 1.0. http://www.w3.org/TR/xpath.
|
| |
69
|
XT. 2007. XT homepage. http://www.blnz.com/xt/index.html.
|
| |
70
|
|
|