ACM Home Page
Please provide us with feedback. Feedback
SpotSigs: robust and efficient near duplicate detection in large web collections
Full text PdfPdf (913 KB)
Source
Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Singapore, Singapore
SESSION: Content analysis table of contents
Pages 563-570  
Year of Publication: 2008
ISBN:978-1-60558-164-4
Authors
Martin Theobald  Stanford University, Stanford, CA, USA
Jonathan Siddharth  Stanford University, Stanford, CA, USA
Andreas Paepcke  Stanford University, Stanford, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 30,   Downloads (12 Months): 310,   Citation Count: 0
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1390334.1390431
What is a DOI?

ABSTRACT

Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor natural-language portions of Web pages over advertisements and navigational bars.

The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure n-gram-based approaches such as Shingling; 2) we provide an exact and efficient, self-tuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as Shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative "Gold Set" of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
4
 
5
 
6
 
7
 
8
C. Buckley, G. Salton, and J. Allan. Automatic retrieval with locality information using SMART. In TREC, p. 59--72, 1992.
9
10
11
12
 
13
14
 
15
16
 
17
 
18
 
19
P. Indyk. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry. CRC Press, 2004.
20
 
21
22
 
23
 
24
25
 
26
Web Sociologist's Workbench: riptsizehttp://dbpubs.stanford.edu/~testbed/doc2/WebBase/SGERHighlight.pdf
 
27
N. Shivakumar and H. García-Molina. SCAM: A copy detection mechanism for digital documents. In DL, 1995.
 
28
 
29
30


Collaborative Colleagues:
Martin Theobald: colleagues
Jonathan Siddharth: colleagues
Andreas Paepcke: colleagues