ACM Home Page
Please provide us with feedback. Feedback
De-duping URLs via rewrite rules
Full text PdfPdf (351 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Las Vegas, Nevada, USA
SESSION: Research papers table of contents
Pages 186-194  
Year of Publication: 2008
ISBN:978-1-60558-193-4
Authors
Anirban Dasgupta  Yahoo!, Sunnyvale, CA, USA
Ravi Kumar  Yahoo!, Sunnyvale, CA, USA
Amit Sasturkar  Yahoo!, Sunnyvale, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 189,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1401890.1401917
What is a DOI?

ABSTRACT

A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content.

In order to express such transformation rules, we propose a simple framework that is general enough to capture the most common URL rewrite patterns occurring on the web; in particular, it encapsulates the DUST (Different URLs with similar text) framework [5]. We provide an efficient algorithm for mining and learning URL rewrite rules and show that under mild assumptions, it is complete, i.e., our algorithm learns every URL rewrite rule that is correct, for an appropriate notion of correctness. We demonstrate the expressiveness of our framework and the effectiveness of our algorithm by performing a variety of extensive large-scale experiments.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
4
5
 
6
M. Bognar. A survey of abstract rewriting, 1995. www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps.
 
7
 
8
9
 
10
11
 
12
13
14
15
 
16
M. Najork. Systems and methods for inferring uniform resource locator (URL) normalization rules, 2006. US Patent Application Publication, 2006/0218143.
 
17

Collaborative Colleagues:
Anirban Dasgupta: colleagues
Ravi Kumar: colleagues
Amit Sasturkar: colleagues