|
||||||||||||||||||||||
|
||||||||||||||||||||||
ABSTRACT
Recently, the move from paper to electronic public comments makes it much easier for individuals to customize form letters while harder for agencies to identify substantive information since there are many near-duplicate comments that express the same viewpoint in slightly different language. The identification of exact- and near-duplicate texts, and recognition of unique text within near-duplicate documents, is an important component of data cleaning and integration processes for eRulemaking.This brief paper describes a demonstration of a near-duplicate detection system, DURIAN (DUplicate Removal In lArge collectioN), that identifies and organizes the near-duplicates for eRulemaking applications. REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
INDEX TERMS
Primary Classification:
General Terms:
Keywords:
|
||||||||||||||||||||||