ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Next steps in near-duplicate detection for eRulemaking
Full text PdfPdf (431 KB)
Source dg.o; Vol. 151 archive
Proceedings of the 2006 international conference on Digital government research table of contents
San Diego, California
SESSION: e-Rulemaking 2 table of contents
Pages: 239 - 248  
Year of Publication: 2006
Authors
Hui Yang  Carnegie Mellon University
Jamie Callan  Carnegie Mellon University
Stuart Shulman  University of Pittsburgh
Sponsor
NSF : National Science Foundation
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 33,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1146598.1146663
What is a DOI?

ABSTRACT

Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that form letter customizations can include substantive issues that agencies are likely to overlook. The identification of exact- and near-duplicate texts, and recognition of unique text within near-duplicate documents, is an important component of data cleaning and integration processes for eRulemaking.This paper presents DURIAN (DUplicate Removal In lArge collectioN), a refinement of a prior near-duplicate detection algorithm DURIAN uses a traditional bag-of-words document representation, document attributes ("metadata"), and document content structure to identify form letters and their edited copies in public comment collections. Experimental results demonstrate that DURIAN is about as effective as human assessors. The paper concludes by discussing challenges to moving near-duplicate detection into operational rulemaking environments.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
3
 
4
J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37--46, 1960.
 
5
C. Coglianese, E-Rulemaking: Information Technology and the Regulatory Process. Administrative Law Review 56(2): 353--402. 2004.
6
 
7
F. Emery and A. Emery, A Modest Proposal: Improve E-Rulemaking by Improving Comments. Administrative and Regulatory Law News, 31(1): 8--9. 2005.
 
8
Government Accountability Office. Electronic Rulemaking: Progress Made in Developing a Centralized E-Rulemaking System. GAO-05-777, 2005.
 
9
K. Gwet. Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement between Raters. Statistical Methods for Inter-rater Reliability Assessment, No. 1, April 2002.
 
10
 
11
C. M. Kerwin, Rulemaking: How Government Agencies Write Law and Make Policy 3rd Ed. CQ Press, Washington, DC, 2003.
 
12
G. T. Lau, K. H. Law, and G. Wiederhold,. A Relatedness Analysis Tool for Comparing Drafted Regulations and Associated Public Comments. I/S 1(1): 95--110. 2005.
 
13
J. S. Lubbers, A Guide to Federal Agency Rulemaking. Third Edition. Chicago, ABA, 1998.
 
14
U. Manber. Finding similar files in a large file system. In 1994 Winter USENIX Technical Conference, pages 1--10, San Francisco, CA, January 1994.
15
 
16
NIST, "Secure Hash Standard", Federal Information Processing Standards Publication 180--1, 1995.
 
17
N. Shivakumar and H. Garcia-Molina. SCAM: a copy detection mechanism for digital documents. In Proc. International Conference on Theory and Practice of Digital Libraries, Austin, Texas, June 1995.
 
18
S. Shulman, L. Thrane, and M. C. Shelley. eRulemaking, in G. David Garson (Ed.) The Handbook of Public Information Systems 2nd Ed. CRC Press, Boca Raton, FL, 2005, 237--254.
 
19
S. W. Shulman, E-Rulemaking: Issues in Current Research and Practice. International Journal of Public Administration 28: 621--641. 2005.
 
20
S. W. Shulman, The Internet Still Might (But Probably Won't) Change Everything. I/S 1(1): 111--145. 2005.
 
21
S. W. Shulman, An Experiment in Digital Government at the U.S. National Organic Program. Agriculture and Human Values 20(3): 253--265, 2003.
 
22
 
23
24


Collaborative Colleagues:
Hui Yang: colleagues
Jamie Callan: colleagues
Stuart Shulman: colleagues