ACM Home Page
Please provide us with feedback. Feedback
Detecting phrase-level duplication on the world wide web
Full text PdfPdf (841 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Salvador, Brazil
SESSION: Web search 1 table of contents
Pages: 170 - 177  
Year of Publication: 2005
ISBN:1-59593-034-5
Authors
Dennis Fetterly  Microsoft Research, Mountain View, CA
Mark Manasse  Microsoft Research, Mountain View, CA
Marc Najork  Microsoft Research, Mountain View, CA
Sponsor
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 68,   Citation Count: 21
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1076034.1076066
What is a DOI?

ABSTRACT

Two years ago, we conducted a study on the evolution of web pages over time. In the course of that study, we discovered a large number of machine-generated "spam" web pages emanating from a handful of web servers in Germany. These spam web pages were dynamically assembled by stitching together grammatically well-formed German sentences drawn from a large collection of sentences. This discovery motivated us to develop techniques for finding other instances of such "slice and dice" generation of web pages, where pages are automatically generated by stitching together phrases drawn from a limited corpus. We applied these techniques to two data sets, a set of 151 million web pages collected in December 2002 and a set of 96 million web pages collected in June 2004. We found a number of other instances of large-scale phrase-level replication within the two data sets. This paper describes the algorithms we used to discover this type of replication, and highlights the results of our data mining.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
Broder, A. Some applications of Rabin's fingerprinting method. In Capocelli, R., De Santis, A., and Vaccaro, U., editors, Sequences II: Methods in Communications, Security, and Computer Science, 143--152, Springer Verlag, 1993.
 
4
 
5
 
6
Davison, B. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search (July 2000).
7
 
8
9
10
 
11
Rabin, M. Fingerprinting by random polynomials. Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.

CITED BY  21

Collaborative Colleagues:
Dennis Fetterly: colleagues
Mark Manasse: colleagues
Marc Najork: colleagues