|
ABSTRACT
With more and more natural language text stored in databases, handling respective query predicates becomes very important. Optimizing queries with predicates includes (sub)string estimation, i.e., estimating the selectivity of query terms based on small summary statistics before query execution. Count Suffix Trees (CST) are commonly used to this end. While CST yield good estimates, they are expensive to build and require a large amount of memory to be stored. To fit in the data dictionary of database systems, they have to be severely pruned. Existing pruning techniques are based on suffix frequency or tree depth. In this paper, we propose new filtering and pruning techniques that reduce both the size of CST over natural-language texts and the cost of building them. The core idea is to exploit features of the natural language data, i.e., regarding only the suffixes that are useful in a linguistic sense. The most important innovations are (a) a new aggressive approximate syllabification technique to filter out suffixes, (b) a new affix and prefix stripping procedure that conflates more terms than conventional stemming techniques, (c) the deployment of state-of-the-art trigram techniques and a new syllable-based mechanism to filter out non-words (i.e., misspellings and other language anomalies like foreign words), which would cause an over-proportional growth of the CST otherwise. -- Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80% and shorten the build phase significantly. From a different perspective, if storage space remains unchanged, the accuracy of selectivity estimates computed from the CST increases by up to 70%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
D. W. Cummings. American English Spelling: An Informal Description. Johns Hopkins University Press, 1988.
|
| |
5
|
D. Graff. The aquaint corpus of english news text. Linguistic Data Consortium, Philadelphia, 2002.
|
| |
6
|
|
 |
7
|
P. Krishnan , Jeffrey Scott Vitter , Bala Iyer, Estimating alphanumeric selectivity in the presence of wildcards, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.282-293, June 04-06, 1996, Montreal, Quebec, Canada
|
 |
8
|
|
 |
9
|
|
| |
10
|
M. Lennon, D. Pierce, B. Tarry, and P. Willett. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3(177--183), 1981.
|
| |
11
|
D. D. Lewis. Reuters-21578. http://www.daviddlewis.com/resources/testcollections/reuters21578/.
|
| |
12
|
|
| |
13
|
J. B. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22--31, 1968.
|
| |
14
|
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
|
| |
15
|
|
| |
16
|
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.
|
| |
17
|
|
| |
18
|
E. M. Zamora, J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information of Processing and Management, 17:305--316, 1981.
|
 |
19
|
|
| |
20
|
R. Giegerich, S. Kurtz, J. Stoye, Efficient Implementation of Lazy Suffix Trees, Software: Practice and Experience, Vol. 33, No 11, 2003
|
|