| Removing manually generated boilerplate from electronic texts: experiments with project Gutenberg e-books |
| Full text |
Pdf
(136 KB)
|
| Source
|
IBM Centre for Advanced Studies Conference
archive
Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
table of contents
Richmond Hill, Ontario, Canada
SESSION: Databases, documents and files
table of contents
Pages: 272 - 275
Year of Publication: 2007
ISSN:1705-7361
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 5, Downloads (12 Months): 22, Citation Count: 0
|
|
|
ABSTRACT
Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg™ corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
T. Atkins. Newgut program. online: http://rumkin.com/reference/gutenberg/newgut, 2004. last checked 18-01-2007.
|
| |
2
|
|
| |
3
|
R. S. Burkey. GutenMark download page. online: http://www.sandroid.org/GutenMark/download.html, 2005. last checked 18-01-2007.
|
 |
4
|
|
 |
5
|
|
 |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
J. Grunenfelder. Weasel reader: Free reading. online: http://gutenpalm.sourceforge.net/, 2006. last checked 18-01-2007.
|
| |
10
|
D. Ideda and Y. Yamada. Gathering text files generated from templates. In IIWeb Workshop, VLDB-2004, 2004.
|
 |
11
|
|
| |
12
|
O. Kaser and D. Lemire. Removing manually generated boilerplate from electronic texts: Experiments with project gutenberg e-books. Technical Report TR-07-001, Dept. of CSAS, UNBSJ, 2007. available from http://arxiv.org/abs/0707. 1913.
|
| |
13
|
J. Misra and D. Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143--152, 1982.
|
| |
14
|
Project Gutenberg Literary Archive Foundation. Project Gutenberg. http://www.gutenberg.org/, 2007. checked 2007-05-30.
|
 |
15
|
|
| |
16
|
R. Segal, J. Crawford, J. Kephart, and B. Leiba. SpamGuru: An enterprise anti-spam filtering system. In Proceedings of the First Conference on E-mail and Anti-Spam, 2004.
|
| |
17
|
|
| |
18
|
Wikipedia. Birthday paradox --- Wikipedia, the free encyclopedia, 2007. {Online; accessed 18-01-2007}.
|
|