ACM Home Page
Please provide us with feedback. Feedback
Removing manually generated boilerplate from electronic texts: experiments with project Gutenberg e-books
Full text PdfPdf (136 KB)
Source IBM Centre for Advanced Studies Conference archive
Proceedings of the 2007 conference of the center for advanced studies on Collaborative research table of contents
Richmond Hill, Ontario, Canada
SESSION: Databases, documents and files table of contents
Pages: 272 - 275  
Year of Publication: 2007
ISSN:1705-7361
Authors
Owen Kaser  University of New Brunswick
Daniel Lemire  Université du Québec à Montréal
Sponsors
: IBM Toronto Software Lab
: IBM Centers for Advanced Studies (CAS)
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 21,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1321211.1321246
What is a DOI?

ABSTRACT

Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg™ corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
T. Atkins. Newgut program. online: http://rumkin.com/reference/gutenberg/newgut, 2004. last checked 18-01-2007.
 
2
 
3
R. S. Burkey. GutenMark download page. online: http://www.sandroid.org/GutenMark/download.html, 2005. last checked 18-01-2007.
4
5
6
 
7
8
 
9
J. Grunenfelder. Weasel reader: Free reading. online: http://gutenpalm.sourceforge.net/, 2006. last checked 18-01-2007.
 
10
D. Ideda and Y. Yamada. Gathering text files generated from templates. In IIWeb Workshop, VLDB-2004, 2004.
11
 
12
O. Kaser and D. Lemire. Removing manually generated boilerplate from electronic texts: Experiments with project gutenberg e-books. Technical Report TR-07-001, Dept. of CSAS, UNBSJ, 2007. available from http://arxiv.org/abs/0707. 1913.
 
13
J. Misra and D. Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143--152, 1982.
 
14
Project Gutenberg Literary Archive Foundation. Project Gutenberg. http://www.gutenberg.org/, 2007. checked 2007-05-30.
15
 
16
R. Segal, J. Crawford, J. Kephart, and B. Leiba. SpamGuru: An enterprise anti-spam filtering system. In Proceedings of the First Conference on E-mail and Anti-Spam, 2004.
 
17
 
18
Wikipedia. Birthday paradox --- Wikipedia, the free encyclopedia, 2007. {Online; accessed 18-01-2007}.
Collaborative Colleagues:
Owen Kaser: colleagues
Daniel Lemire: colleagues