|
ABSTRACT
Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that form letter customizations can include substantive issues that agencies are likely to overlook. The identification of exact- and near-duplicate texts, and recognition of unique text within near-duplicate documents, is an important component of data cleaning and integration processes for eRulemaking.This paper presents DURIAN (DUplicate Removal In lArge collectioN), a refinement of a prior near-duplicate detection algorithm DURIAN uses a traditional bag-of-words document representation, document attributes ("metadata"), and document content structure to identify form letters and their edited copies in public comment collections. Experimental results demonstrate that DURIAN is about as effective as human assessors. The paper concludes by discussing challenges to moving near-duplicate detection into operational rulemaking environments.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
2
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
 |
3
|
|
| |
4
|
J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37--46, 1960.
|
| |
5
|
C. Coglianese, E-Rulemaking: Information Technology and the Regulatory Process. Administrative Law Review 56(2): 353--402. 2004.
|
 |
6
|
|
| |
7
|
F. Emery and A. Emery, A Modest Proposal: Improve E-Rulemaking by Improving Comments. Administrative and Regulatory Law News, 31(1): 8--9. 2005.
|
| |
8
|
Government Accountability Office. Electronic Rulemaking: Progress Made in Developing a Centralized E-Rulemaking System. GAO-05-777, 2005.
|
| |
9
|
K. Gwet. Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement between Raters. Statistical Methods for Inter-rater Reliability Assessment, No. 1, April 2002.
|
| |
10
|
|
| |
11
|
C. M. Kerwin, Rulemaking: How Government Agencies Write Law and Make Policy 3rd Ed. CQ Press, Washington, DC, 2003.
|
| |
12
|
G. T. Lau, K. H. Law, and G. Wiederhold,. A Relatedness Analysis Tool for Comparing Drafted Regulations and Associated Public Comments. I/S 1(1): 95--110. 2005.
|
| |
13
|
J. S. Lubbers, A Guide to Federal Agency Rulemaking. Third Edition. Chicago, ABA, 1998.
|
| |
14
|
U. Manber. Finding similar files in a large file system. In 1994 Winter USENIX Technical Conference, pages 1--10, San Francisco, CA, January 1994.
|
 |
15
|
Donald Metzler , Yaniv Bernstein , W. Bruce Croft , Alistair Moffat , Justin Zobel, Similarity measures for tracking information flow, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany
[doi> 10.1145/1099554.1099695]
|
| |
16
|
NIST, "Secure Hash Standard", Federal Information Processing Standards Publication 180--1, 1995.
|
| |
17
|
N. Shivakumar and H. Garcia-Molina. SCAM: a copy detection mechanism for digital documents. In Proc. International Conference on Theory and Practice of Digital Libraries, Austin, Texas, June 1995.
|
| |
18
|
S. Shulman, L. Thrane, and M. C. Shelley. eRulemaking, in G. David Garson (Ed.) The Handbook of Public Information Systems 2nd Ed. CRC Press, Boca Raton, FL, 2005, 237--254.
|
| |
19
|
S. W. Shulman, E-Rulemaking: Issues in Current Research and Practice. International Journal of Public Administration 28: 621--641. 2005.
|
| |
20
|
S. W. Shulman, The Internet Still Might (But Probably Won't) Change Everything. I/S 1(1): 111--145. 2005.
|
| |
21
|
S. W. Shulman, An Experiment in Digital Government at the U.S. National Organic Program. Agriculture and Human Values 20(3): 253--265, 2003.
|
| |
22
|
|
| |
23
|
|
 |
24
|
|
|