ACM Home Page
Please provide us with feedback. Feedback
CEBBIP: a parser of bibliographic information in chinese electronic books
Full text PdfPdf (489 KB)
Source
International Conference on Digital Libraries archive
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries table of contents
Austin, TX, USA
SESSION: 2 table of contents
Pages 73-76  
Year of Publication: 2009
ISBN:978-1-60558-322-8
Authors
Liangcai Gao  Institute of Computer Science and Technology of Peking University, Beijing, China
Zhi Tang  Institute of Computer Science and Technology of Peking University, Beijing, China
Xiaofan Lin  Vobile Inc., Santa Clara, USA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 51,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1555400.1555412
What is a DOI?

ABSTRACT

Bibliographic information is essential for many digital library applications, such as citation analysis, academic searching and topic discovery. And bibliographic data extraction has attracted a great deal of attention in recent years. In this paper, we address the problem of automatic extraction of bibliographic data in Chinese electronic book and propose a tool called CEBBIP* for the task, which includes three main systems: data preprocessing, data parsing and data postprocessing. In the data preprocessing system, the tool adopts a rules-based method to locate citation data in a book and to segment citation data into citation strings of individual referencing literature. And a learning-based approach, Conditional Random Fields (CRF), is employed to parse citation strings in the data parsing system. Finally, the tool takes advantage of document intrinsic local format consistency to enhance citation data segmentation and parsing through clustering techniques. CEBBIP has been used in a commercial E-book production system. Experimental results show that CEBBIP's precision rate is very high. More specially, adopting the document intrinsic local format consistency obviously improves the citation data segmenting and parsing accuracy.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
 
4
5
 
6
Gao, L., and Tang, Z., "A mixed approach to book splitting", Proc. of SPIE Conference on Document Recognition and Retrieval XV, San Jose, 2008, p0B--1/0B--8.
 
7
 
8
 
9
Huang, A., Ho, J. M., Kao, H. Y., and Lin, S. H. 2004. Extracting citation metadata from online publication lists using BLAST. In Proceedings of the PAKDD '04 (Sydney, Australia, May 26--28, 2004). Springer, Berlin, vol. 3056, 539--548.
10
 
11
Li, C., Zhang, M., Deng, Z., Yang D., and Tang, S., "Automatic Metadata Extraction for Scientific Documents", Computer Engineering and Application, 2002, Vol 21, 189--191,235.
 
12
Peng, F., and McCallum, A. 2004. Accurate information extraction from research papers using conditional random fields. In Proceeding of the HLTNAACL '04 (Boston, MA, USA, May 2 -- 7, 2004). pp. 329--336.
 
13
Seymore, K., McCallum, A., and Rosenfeld, R. 1999. Learning hidden Markov model structure for information extraction. In Proceeding of the AAAI '99 (Orlando, FL, USA, July 18--22, 1999). 37--42.
 
14
 
15

Collaborative Colleagues:
Liangcai Gao: colleagues
Zhi Tang: colleagues
Xiaofan Lin: colleagues