ACM Home Page
Please provide us with feedback. Feedback
Finding duplicates in a data stream
Full text PdfPdf (508 KB)
Source Symposium on Discrete Algorithms archive
Proceedings of the Nineteenth Annual ACM -SIAM Symposium on Discrete Algorithms table of contents
New York, New York
Pages 402-411  
Year of Publication: 2009
Authors
Parikshit Gopalan  University of Washington & Microsoft Research SVC
Jaikumar Radhakrishnan  TIFR, Mumbai
Publisher
Society for Industrial and Applied Mathematics  Philadelphia, PA, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 131,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  

ABSTRACT

Given a data stream of length n over an alphabet [m] where n > m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((log m)3) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this problem could be solved using sub-linear space and one pass over the input. Our algorithm solves the more general problem of finding a positive frequency element in a stream given by frequency updates where the sum of all frequencies is positive. Our main tool is an Isolation Lemma that reduces this problem to the task of detecting and identifying a Dictatorial variable in a Boolean halfspace. We present various relaxations of the condition n > m, under which one can find duplicates efficiently.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
 
5
{Cha08} M. Charikar. Top-k frequent item maintenance over streams. Survey article, 2008.
 
6
{CKS03} A. Chakrabarti, S. Khot, and X. Sun. Near-optimal lower bounds on the multiparty communication complexity of set-disjointness. In Proc. 18th IEEE Conference on Computational Complexity, pages 107--117, 2003.
7
 
8
{For05} L. Fortnow. Finding duplicates: Post on the Computational Complexity weblog, March 4th 2005.
 
9
10
 
11
 
12
13
 
14
{Mut05} S. Muthukrishnan. Data Streams: Algorithms and Applications. Now Publishers Inc., 2005.
 
15
{Nis92} N. Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449--461, 1992.
 
16
 
17
{Tar07} J. Tarui. Finding a duplicate and a missing item in a stream. In TAMC, pages 128--135, 2007.

Collaborative Colleagues:
Parikshit Gopalan: colleagues
Jaikumar Radhakrishnan: colleagues