ACM Home Page
Please provide us with feedback. Feedback
Duplicate detection in click streams
Full text PdfPdf (226 KB)
Source International World Wide Web Conference archive
Proceedings of the 14th international conference on World Wide Web table of contents
Chiba, Japan
SESSION: Usage analysis table of contents
Pages: 12 - 21  
Year of Publication: 2005
ISBN:1-59593-046-9
Authors
Ahmed Metwally  University of California at Santa Barbara, Santa Barbara, CA
Divyakant Agrawal  University of California at Santa Barbara, Santa Barbara, CA
Amr El Abbadi  University of California at Santa Barbara, Santa Barbara, CA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 93,   Citation Count: 13
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1060745.1060753
What is a DOI?

ABSTRACT

We consider the problem of finding duplicates in data streams. Duplicate detection in data streams is utilized in various applications including fraud detection. We develop a solution based on Bloom Filters [9], and discuss the space and time requirements for running the proposed algorithm in both the contexts of sliding, and landmark stream windows. We run a comprehensive set of experiments, using both real and synthetic click streams, to evaluate the performance of the proposed solution. The results demonstrate that the proposed solution yields extremely low error rates.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In Proceedings of the 28th ACM VLDB International Conference on Very Large Databases, pages 586--597, 2002.
 
3
 
4
A. Arasu, S. Babu, and J. Widom. CQL: A Language for Continuous Queries over Streams and Relations. In Proceedings of the 9th DBPL International Conference on Data Base and Programming Languages, pages 1--11, 2003.
 
5
A. Arasu, S. Babu, and J. Widom. The CQL Continuous Query Language: Semantic Foundations and Query Execution. Technical Report 2002-67, Stanford University, 2003.
6
7
8
9
 
10
 
11
P. Bose, E. Kranakis, P. Morin, and Y. Tang. Bounds for Frequency Estimation of Packet Streams. In Proceedings of the 10th SIROCCO International Colloquium on Structural Information and Communication Complexity, pages 33--42, 2003.
 
12
13
14
15
 
16
G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Finding Hierarchical Heavy Hitters in Data Streams. In Proceedings of the 29th ACM VLDB International Conference on Very Large Data Bases, pages 464--475, 2003.
17
18
19
 
20
 
21
22
 
23
 
24
 
25
 
26
 
27
28
29
 
30
31
 
32
L. Golab, S. Garg, and M. Ozsu. On Indexing Sliding Windows over On-Line Data Streams. In Proceedings of the 9th EDBT International Conference on Extending Database Technology, pages 712--729, 2004.
33
34
 
35
36
37
38
 
39
 
40
41
42
 
43
 
44
 
45
 
46
 
47
 
48
G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. In Proceedings of the 28th ACM VLDB International Conference on Very Large Data Bases, pages 346--357, 2002.
49
 
50
 
51
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. In Proceedings of the 10th ICDT International Conference on Database Theory, pages 398--412, 2005.
 
52
A. Monge. Matching Algorithms within a Duplicate Detection System. IEEE Data Engineering Bulletin, 23(4):14--20, 2000.
 
53
A. Monge and C. Elkan. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. In Proceedings of ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 23--29, 1997.
 
54
M. Reiter, V. Anupam, and A. Mayer. Detecting Hit-Shaving in Click-Through Payment Schemes. In Proceedings of the 3rd USENIX Workshop on Electronic Commerce, pages 155--166, 1998.
 
55
Z. Tian, H. Lu, W. Ji, A. Zhou, and Z. Tian. An N-gram-Based Approach for Detecting Approximately Duplicate Database Records. International Journal on Digital Library, 3(4):325--331, 2002.
56
 
57
S. Ye, R. Song, J. Wen, and W. Ma. A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines. In Proceedings of the 6th Asia-Pacific Web Conference, pages 48--58, 2004.
 
58
Y. Zhu and D. Shasha. StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time. In Proceedings of the 28th ACM VLDB International Conference on Very Large Data Bases, pages 358--369, 2002.

CITED BY  13

Collaborative Colleagues:
Ahmed Metwally: colleagues
Divyakant Agrawal: colleagues
Amr El Abbadi: colleagues