ACM Home Page
Please provide us with feedback. Feedback
Finding frequent items in probabilistic data
Full text PdfPdf (372 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data table of contents
Vancouver, Canada
SESSION: Research Session 17: Probabilistic II table of contents
Pages 819-832  
Year of Publication: 2008
ISBN:978-1-60558-102-6
Authors
Qin Zhang  Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Feifei Li  Florida State University, Tallahassee, USA
Ke Yi  Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 21,   Downloads (12 Months): 267,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376616.1376698
What is a DOI?

ABSTRACT

Computing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on probabilistic data: finding the frequent items. One straightforward approach to identify the frequent items in a probabilistic data set is to simply compute the expected frequency of an item and decide if it exceeds a certain fraction of the expected size of the whole data set. However, this simple definition misses important information about the internal structure of the probabilistic data and the interplay among all the uncertain entities. Thus, we propose a new definition based on the possible world semantics that has been widely adopted for many query types in uncertain data management, trying to find all the items that are likely to be frequent in a randomly generated possible world. Our approach naturally leads to the study of ranking frequent items based on confidence as well.

Finding likely frequent items in probabilistic data turns out to be much more difficult. We first propose exact algorithms for offline data with either quadratic or cubic time. Next, we design novel sampling-based algorithms for streaming data to find all approximately likely frequent items with theoretically guaranteed high probability and accuracy. Our sampling schemes consume sublinear memory and exhibit excellent scalability. Finally, we verify the effectiveness and efficiency of our algorithms using both real and synthetic data sets with extensive experimental evaluations.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
C. Aggarwal. On density based transforms for uncertain data mining. In ICDE, 2007.
 
2
 
3
 
4
 
5
L. Antova, C. Koch, and D. Olteanu. 10106 worlds and beyond: Efficient representation and processing of incomplete information. In ICDE, 2007.
6
 
7
8
9
10
11
12
 
13
 
14
 
15
 
16
17
 
18
19
20
 
21
22
 
23
24
25
 
26
B. Kanagal and A. Deshpande. Online filtering, smoothing and probabilistic modeling of streaming data. In ICDE, 2008.
27
28
 
29
V. Ljosa and A. Singh. APLA: Indexing arbitrary probability distributions. In ICDE, 2007.
 
30
 
31
32
 
33
 
34
C. Re, N. Dalvi, and D. Suciu. Efficient top-k query evaluation on probalistic databases. In ICDE, 2007.
 
35
C. Re and D. Suciu. Materialized views in probabilistic databases for information exchange and query optimization. In VLDB, 2007.
 
36
 
37
P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.
 
38
S. Singh, C. Mayfield, S. Prabhakar, R. Shah, and S. Hambrusch. Indexing uncertain categorical data. In ICDE, 2007.
 
39
M. A. Soliman, I. F. Ilyas, and K. C. Chang. Top-k query processing in uncertain databases. In ICDE, 2007.
 
40