ACM Home Page
Please provide us with feedback. Feedback
A data structure for a sequence of string accesses in external memory
Full text PdfPdf (230 KB)
Source
ACM Transactions on Algorithms (TALG) archive
Volume 3 ,  Issue 1  (February 2007) table of contents
SECTION: 2 - Regular Papers table of contents
Article No. 6  
Year of Publication: 2007
ISSN:1549-6325
Authors
Valentina Ciriani  University of Milano, Via Bramante, Crema
Paolo Ferragina  University of Pisa, Largo Pontecorro, Pisa
Fabrizio Luccio  University of Pisa, Largo Pontecorro, Pisa
S. Muthukrishnan  Rutgers University, Piscataway, NJ
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 21,   Downloads (12 Months): 90,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1186810.1186816
What is a DOI?

ABSTRACT

We introduce a new paradigm for querying strings in external memory, suited to the execution of sequences of operations. Formally, given a dictionary of n strings S1, …, Sn, we aim at supporting a search sequence for m not necessarily distinct strings T1, T2, …, Tm, as well as inserting and deleting individual strings. The dictionary is stored on disk, where each access to a disk page fetches B items, the cost of an operation is the number of pages accessed (I/Os), and efficiency must be attained on entire sequences of string operations rather than on individual ones.

Our approach relies on a novel and conceptually simple self-adjusting data structure (SASL) based on skip lists, that is also interesting per se. The search for the whole sequence T1, T2, …, Tm can be done in an expected number of I/Os:

O(∑j=1m |Tj|/B + ∑i=1nn (ni logB m/ni)),

where each Tj may or may not be present in the dictionary, and ni is the number of times Si is queried (i.e., the number of Tjs equal to Si). Moreover, inserting or deleting a string Si takes an expected amortized number O(|Si|/B + logB n) of I/Os. The term ∑j=1m |Tj|/B in the search formula is a lower bound for reading the input, and the term ∑i=1n ni logB m/ni (entropy of the query sequence) is a standard information-theoretic lower bound. We regard this result as the static optimality theorem for external-memory string access, as compared to Sleator and Tarjan's classical theorem for numerical dictionaries [Sleator and Tarjan 1985]. Finally, we reformulate the search bound if a cache is available, taking advantage of common prefixes among the strings examined in the search.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Aboulnaga, A., Alameldeen, A. R., and Naughton, J. F. 2001. Estimating the selectivity of XML path expressions for Internet scale applications. In Proceedings of the 27th International Conference on Very Large Data Bases. 591--600.
 
2
Abramson, N. 1983. Information Theory and Coding. McGraw-Hill, New York.
 
3
Ailamaki, A., DeWitt, D. J., Hill, M. D., and Wood, D. A. 1999. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th International Conference on Very Large Data Bases. 266--277.
 
4
Bedathur, S. J., and Haritsa, J. R. 2004. Engineering a fast online persistent suffix tree construction. In Proceedings of the 20th International Conference on Data Engineering. 720--731.
 
5
Blum, A., Chawla, S., and Kalai, A. 2003. Static optimality and dynamic search optimality in lists and trees. Algorithmica 36, 3, 249--260.
 
6
Callahan, P. B., Goodrich, M. T., and Ramaiyer, K. 1995. Topology B-trees and their applications. In Proceedings of the 4th Workshop on Algorithms and Data Structures. Lecture Notes in Computer Science, vol. 955, Springer Verlag. 381--392.
 
7
Chazelle, B., and Guibas, L. J. 1986. Fractional cascading: I. A data structuring technique. Algorithmica 1, 2, 133--162.
 
8
Cheung, C., Yu, J., and Lu, H. 2005. Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Trans. Knowl. Data Engi. 17, 1, 90--105.
 
9
Cooper, B., Sample, N., Franklin, M. J., Hjaltason, G. R., and Shadmon, M. 2001. A fast index for semistructured data. In Proceedings of the 27th International Conference on Very Large Data Bases. 341--350.
 
10
Ergun, F., Sahinalp, S. C., Sharp, J., and Sinha, R. K. 2001. Biased dictionaries with fast insert/deletes. In Proceedings of the 33rd ACM Symposium on the Theory of Computing. 483--491.
 
11
Ferguson, D. E. 1992. Bit-Tree: A data structure for fast file processing. Commun. ACM 35, 6, 114--120.
 
12
Ferragina, P. 2005. String search in external memory: Algorithms and data structures. In Handbook of Computational Molecular Biology, S. Aluru, ed., Chapman and Hall, London.
 
13
Ferragina, P., and Grossi, R. 1999. The string B-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2, 236--280.
 
14
Frigo, M., Leiserson, C., Prokop, H., and Ramachandran, S. 1999. Cache-Oblivious algorithms. In Proceedings of the 40th IEEE Symposium on the Foundations of Computer Science. 285--298.
 
15
Giegerich, R., Kurtz, S., and Stoye, J. 2003. Efficient implementation of lazy suffix trees. Softw. Pract. Exper. 33, 1035--1049.
 
16
Grossi, R., and Italiano, G. 1999. Efficient techniques for maintaining multidimensional keys in linked data structures. In Proceedings of the 26th International Colloquium on Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 1644. Springer Verlag. 372--381.
 
17
Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York.
 
18
Hunt, E., Atkinson, M., and Irving, R. 2002. Database indexing for large DNA and protein sequence collections. Int. J. Very Large Data Bases 11, 3, 256--271.
 
19
Iacono, J. 2001. Alternatives to splay trees with o(log n) worst-case times. In Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms. 516--522.
 
20
Kärkkäinen, J., and Rao, S. 2003. Full-Text indexes in external memory. In Algorithms for Memory Hierarchies, U. Meyer et al. eds. Lecture Notes in Computer Science, vol. 2625. Springer Verlag, 149--170.
 
21
Manber, U., and Myers, G. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948.
 
22
Martel, C. 1991. Self-Adjusting multi-way search trees. Inf. Process. Lett. 38, 3, 135--141.
 
23
Mehlhorn, K. 1984. Data Structures and Algorithms 1: Searching and Sorting. EATCS Monographs on Theoretical Computer Science, vol. 1. Springer Verlag.
 
24
Mehlhorn, K., and Naher, S. 1992. Algorithm design and software libraries: Recent developments in the Leda project. In IFIP World Computer Congress, vol. 1, 493--505.
 
25
Mulmuley, K. 1994. Computational Geometry: An Introduction Through Randomized Algorithms. Prentice-Hall, Upper Saddle River, NJ.
 
26
Pugh, W. 1990. Skip lists: A probabilistic alternative to balanced trees. Commun. ACM 33, 6, 668--676.
 
27
Sherk, M. 1995. Self-Adjusting k-ary search trees. J. Algorithms 19, 1, 25--44.
 
28
Sleator, D., and Tarjan, R. 1985. Self-Adjusting binary search trees. J. ACM 32, 3, 652--686.
 
29
Tata, S., Hankins, R. A., and Patel, J. M. 2004. Practical suffix tree construction. In Proceedings of the 13th International Conference on Very Large Data Bases. 36--47.
 
30
Vitter, J. 2002. External memory algorithms and data structures: Dealing with MASSIVE DATA. ACM Comput. Surv. 33, 2, 209--271.
 
31
Vitter, J. S., and Krishnan, P. 1996. Optimal prefetching via data compression. J. ACM 43, 5, 771--793.
 
32
Xie, Y., and O'Hallaran, D. 2002. Locality in search engine queries and its implications for caching. In Proceedings of the 21st IEEE INFOCOM Conference. 1238--1247.