ACM Home Page
Please provide us with feedback. Feedback
Schema-conscious filtering of XML documents
Full text PdfPdf (579 KB)
Source Extending Database Technology; Vol. 360 archive
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology table of contents
Saint Petersburg, Russia
SESSION: Research sessions: Provenance table of contents
Pages 970-981  
Year of Publication: 2009
ISBN:978-1-60558-422-5
Authors
Panu Silvasti  Helsinki University of Technology
Seppo Sippu  University of Helsinki
Eljas Soisalon-Soininen  Helsinki University of Technology
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 55,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1516360.1516471
What is a DOI?

ABSTRACT

In a publish-subscribe system based on filtering of XML documents, subscribers specify their interests with profiles expressed in the XPath language. The system processes a stream of XML documents and delivers to subscribers a notification or content of documents that match the profiles. For filtering with profiles expressed as linear XPath queries, automaton-based approaches exist where the intractable size growth of a preconstructed deterministic finite automaton is avoided by using a nondeterministic automaton. In this article we examine how these general approaches, which do not assume the existence of any specific schema or document type definition (DTD), might benefit from the knowledge that all the XML documents to be filtered obey a given DTD.

We present an algorithm that utilizes the DTD in the preprocessing phase of the filtering automaton to prune out descendant operators (//) and wildcards (*) from the linear XPath filters. Experiments with data obtained from the XML Data Repository of the Univ. of Washington indicate that filter pruning can increase the throughput of the nondeterministic YFilter automaton by Diao et al. by a factor of 2 to 20. We also present a new filtering algorithm that is based on a backtracking deterministic finite automaton derived from the classic Aho--Corasick pattern-matching automaton. This automaton has a size linear in the sum of the sizes of the filters. For our algorithm, we obtained a throughput of 15 MB/sec for filters pruned from one million original filters (with all wildcards and non-leading descendant operators eliminated), representing an improvement by a factor of 2 to 3 upon the throughput of YFilter.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
4
 
5
6
7
 
8
9
 
10
 
11
E. Soisalon-Soininen and T. Ylönen. On classification of strings. In String Processing and Information Retrieval, 11th Internat. Conf., SPIRE 2004, Proceedings, pages 321--330, 2004.
 
12
D. Suciu. XML data repository. The Database Research Group, University of Washington, 2006. www.cs.washington.edu/research/xmldatasets/.
Collaborative Colleagues:
Panu Silvasti: colleagues
Seppo Sippu: colleagues
Eljas Soisalon-Soininen: colleagues