ACM Home Page
Please provide us with feedback. Feedback
Improved count suffix trees for natural language data
Full text PdfPdf (145 KB)
Source
ACM International Conference Proceeding Series; Vol. 299 archive
Proceedings of the 2008 international symposium on Database engineering & applications table of contents
Coimbra, Portugal
SESSION: QoS, query processing, optimization table of contents
Pages 231-236  
Year of Publication: 2008
ISBN:978-1-60558-188-0
Authors
Guido Sautter  Universität Karlsruhe (TH), Karlsruhe
Cristina Abba  Corso Francia, Torino
Klemens Böhm  Universität Karlsruhe (TH), Karlsruhe
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 30,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1451940.1451972
What is a DOI?

ABSTRACT

With more and more natural language text stored in databases, handling respective query predicates becomes very important. Optimizing queries with predicates includes (sub)string estimation, i.e., estimating the selectivity of query terms based on small summary statistics before query execution. Count Suffix Trees (CST) are commonly used to this end. While CST yield good estimates, they are expensive to build and require a large amount of memory to be stored. To fit in the data dictionary of database systems, they have to be severely pruned. Existing pruning techniques are based on suffix frequency or tree depth. In this paper, we propose new filtering and pruning techniques that reduce both the size of CST over natural-language texts and the cost of building them. The core idea is to exploit features of the natural language data, i.e., regarding only the suffixes that are useful in a linguistic sense. The most important innovations are (a) a new aggressive approximate syllabification technique to filter out suffixes, (b) a new affix and prefix stripping procedure that conflates more terms than conventional stemming techniques, (c) the deployment of state-of-the-art trigram techniques and a new syllable-based mechanism to filter out non-words (i.e., misspellings and other language anomalies like foreign words), which would cause an over-proportional growth of the CST otherwise. -- Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80% and shorten the build phase significantly. From a different perspective, if storage space remains unchanged, the accuracy of selectivity estimates computed from the CST increases by up to 70%.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
D. W. Cummings. American English Spelling: An Informal Description. Johns Hopkins University Press, 1988.
 
5
D. Graff. The aquaint corpus of english news text. Linguistic Data Consortium, Philadelphia, 2002.
 
6
7
8
9
 
10
M. Lennon, D. Pierce, B. Tarry, and P. Willett. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3(177--183), 1981.
 
11
D. D. Lewis. Reuters-21578. http://www.daviddlewis.com/resources/testcollections/reuters21578/.
 
12
 
13
J. B. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22--31, 1968.
 
14
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
 
15
 
16
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.
 
17
 
18
E. M. Zamora, J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information of Processing and Management, 17:305--316, 1981.
19
 
20
R. Giegerich, S. Kurtz, J. Stoye, Efficient Implementation of Lazy Suffix Trees, Software: Practice and Experience, Vol. 33, No 11, 2003

Collaborative Colleagues:
Guido Sautter: colleagues
Cristina Abba: colleagues
Klemens Böhm: colleagues