ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Combining DOM tree and geometric layout analysis for online medical journal article segmentation
Full text PdfPdf (522 KB)
Source International Conference on Digital Libraries archive
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries table of contents
Chapel Hill, NC, USA
SESSION: Document analysis table of contents
Pages: 119 - 128  
Year of Publication: 2006
ISBN:1-59593-354-9
Authors
Jie Zou  National Library of Medicine, Bethesda, MD
Daniel Le  National Library of Medicine, Bethesda, MD
George R. Thoma  National Library of Medicine, Bethesda, MD
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 43,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1141753.1141777
What is a DOI?

ABSTRACT

We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content is modeled by a zone tree structure based primarily on the geometric layout of the web page. For a given journal article, a zone tree is generated by combining DOM tree analysis and recursive X-Y cut algorithm. Combining with other visual cues, such as background color, font size, font color and so on, the page is segmented into homogeneous regions. Evaluation is conducted with 104 articles from 11 journals. Out of 9726 ground-truth zones, 9376 zones are correctly segmented, for an accuracy of 96.40%. Segmenting the entire web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Baird, H.S., Jones, S.E., and Fortune, S.J., Image Segmentation by Shape-Directed Covers, Proc. International Conference Pattern Recognition, pp. 820--825, 1990.
2
 
3
Cai, D., Yu, S., Wen, J. R., and Ma, W. Y., Extracting Content Structure for Web Pages Based on Visual Representation, Proc. of 5th Asia Pacific Web Conference, 2003.
 
4
Cai, D., Yu, S., Wen J. R., and Ma, W. Y., VIPS: a Vision-Based Page Segmentation Algorithm, Microsoft Technical Report (MSR-TR-2003-79), 2003.
5
 
6
 
7
 
8
Hauser, S.E., Le D.X., and Thoma G.R., Automated zone correction in bitmapped document images, Proc. SPIE: Document Recognition and Retrieval VII, SPIE Vol. 3976, San Jose, CA, pp. 248--258, 2000.
 
9
 
10
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18


Collaborative Colleagues:
Jie Zou: colleagues
Daniel Le: colleagues
George R. Thoma: colleagues