|
ABSTRACT
We consider the problem of finding duplicates in data streams. Duplicate detection in data streams is utilized in various applications including fraud detection. We develop a solution based on Bloom Filters [9], and discuss the space and time requirements for running the proposed algorithm in both the contexts of sliding, and landmark stream windows. We run a comprehensive set of experiments, using both real and synthetic click streams, to evaluate the performance of the proposed solution. The results demonstrate that the proposed solution yields extremely low error rates.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237814.237823]
|
| |
2
|
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In Proceedings of the 28th ACM VLDB International Conference on Very Large Databases, pages 586--597, 2002.
|
| |
3
|
Vinod Anupam , Alain Mayer , Kobbi Nissim , Benny Pinkas , Michael K. Reiter, On the security of pay-per-click and other Web advertising schemes, Proceeding of the eighth international conference on World Wide Web, p.1091-1100, May 1999, Toronto, Canada
|
| |
4
|
A. Arasu, S. Babu, and J. Widom. CQL: A Language for Continuous Queries over Streams and Relations. In Proceedings of the 9th DBPL International Conference on Data Base and Programming Languages, pages 1--11, 2003.
|
| |
5
|
A. Arasu, S. Babu, and J. Widom. The CQL Continuous Query Language: Semantic Foundations and Query Execution. Technical Report 2002-67, Stanford University, 2003.
|
 |
6
|
|
 |
7
|
|
 |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
P. Bose, E. Kranakis, P. Morin, and Y. Tang. Bounds for Frequency Estimation of Packet Streams. In Proceedings of the 10th SIROCCO International Colloquium on Structural Information and Communication Complexity, pages 33--42, 2003.
|
| |
12
|
|
 |
13
|
Jianjun Chen , David J. DeWitt , Feng Tian , Yuan Wang, NiagaraCQ: a scalable continuous query system for Internet databases, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.379-390, May 15-18, 2000, Dallas, Texas, United States
|
 |
14
|
|
 |
15
|
|
| |
16
|
G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Finding Hierarchical Heavy Hitters in Data Streams. In Proceedings of the 29th ACM VLDB International Conference on Very Large Data Bases, pages 464--475, 2003.
|
 |
17
|
|
 |
18
|
|
 |
19
|
Corinna Cortes , Kathleen Fisher , Daryl Pregibon , Anne Rogers, Hancock: a language for extracting signatures from data streams, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.9-17, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347094]
|
| |
20
|
Mayur Datar , Aristides Gionis , Piotr Indyk , Rajeev Motwani, Maintaining stream statistics over sliding windows: (extended abstract), Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.635-644, January 06-08, 2002, San Francisco, California
|
| |
21
|
|
 |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
 |
28
|
Johannes Gehrke , Flip Korn , Divesh Srivastava, On computing correlated aggregates over continual data streams, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.13-24, May 21-24, 2001, Santa Barbara, California, United States
|
 |
29
|
|
| |
30
|
|
 |
31
|
Lukasz Golab , David DeHaan , Erik D. Demaine , Alejandro Lopez-Ortiz , J. Ian Munro, Identifying frequent items in sliding windows over on-line packet streams, Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, October 27-29, 2003, Miami Beach, FL, USA
[doi> 10.1145/948205.948227]
|
| |
32
|
L. Golab, S. Garg, and M. Ozsu. On Indexing Sliding Windows over On-Line Data Streams. In Proceedings of the 9th EDBT International Conference on Extending Database Technology, pages 712--729, 2004.
|
 |
33
|
|
 |
34
|
|
| |
35
|
|
 |
36
|
|
 |
37
|
|
 |
38
|
Pankaj Gupta , Nick McKeown, Packet classification on multiple fields, Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication, p.147-160, August 30-September 03, 1999, Cambridge, Massachusetts, United States
|
| |
39
|
|
| |
40
|
|
 |
41
|
Cheqing Jin , Weining Qian , Chaofeng Sha , Jeffrey X. Yu , Aoying Zhou, Dynamically maintaining frequent items over a data stream, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956918]
|
 |
42
|
|
| |
43
|
|
| |
44
|
|
| |
45
|
|
| |
46
|
|
| |
47
|
|
| |
48
|
G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. In Proceedings of the 28th ACM VLDB International Conference on Very Large Data Bases, pages 346--357, 2002.
|
 |
49
|
Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay, Random sampling techniques for space efficient online computation of order statistics of large datasets, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.251-262, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
50
|
|
| |
51
|
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. In Proceedings of the 10th ICDT International Conference on Database Theory, pages 398--412, 2005.
|
| |
52
|
A. Monge. Matching Algorithms within a Duplicate Detection System. IEEE Data Engineering Bulletin, 23(4):14--20, 2000.
|
| |
53
|
A. Monge and C. Elkan. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. In Proceedings of ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 23--29, 1997.
|
| |
54
|
M. Reiter, V. Anupam, and A. Mayer. Detecting Hit-Shaving in Click-Through Payment Schemes. In Proceedings of the 3rd USENIX Workshop on Electronic Commerce, pages 155--166, 1998.
|
| |
55
|
Z. Tian, H. Lu, W. Ji, A. Zhou, and Z. Tian. An N-gram-Based Approach for Detecting Approximately Duplicate Database Records. International Journal on Digital Library, 3(4):325--331, 2002.
|
 |
56
|
|
| |
57
|
S. Ye, R. Song, J. Wen, and W. Ma. A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines. In Proceedings of the 6th Asia-Pacific Web Conference, pages 48--58, 2004.
|
| |
58
|
Y. Zhu and D. Shasha. StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time. In Proceedings of the 28th ACM VLDB International Conference on Very Large Data Bases, pages 358--369, 2002.
|
|