| Combining DOM tree and geometric layout analysis for online medical journal article segmentation |
| Full text |
Pdf
(522 KB)
|
| Source
|
International Conference on Digital Libraries
archive
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
table of contents
Chapel Hill, NC, USA
SESSION: Document analysis
table of contents
Pages: 119 - 128
Year of Publication: 2006
ISBN:1-59593-354-9
|
|
Authors
|
|
Jie Zou
|
National Library of Medicine, Bethesda, MD
|
|
Daniel Le
|
National Library of Medicine, Bethesda, MD
|
|
George R. Thoma
|
National Library of Medicine, Bethesda, MD
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 4, Downloads (12 Months): 43, Citation Count: 2
|
|
|
ABSTRACT
We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content is modeled by a zone tree structure based primarily on the geometric layout of the web page. For a given journal article, a zone tree is generated by combining DOM tree analysis and recursive X-Y cut algorithm. Combining with other visual cues, such as background color, font size, font color and so on, the page is segmented into homogeneous regions. Evaluation is conducted with 104 articles from 11 journals. Out of 9726 ground-truth zones, 9376 zones are correctly segmented, for an accuracy of 96.40%. Segmenting the entire web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Baird, H.S., Jones, S.E., and Fortune, S.J., Image Segmentation by Shape-Directed Covers, Proc. International Conference Pattern Recognition, pp. 820--825, 1990.
|
 |
2
|
|
| |
3
|
Cai, D., Yu, S., Wen, J. R., and Ma, W. Y., Extracting Content Structure for Web Pages Based on Visual Representation, Proc. of 5th Asia Pacific Web Conference, 2003.
|
| |
4
|
Cai, D., Yu, S., Wen J. R., and Ma, W. Y., VIPS: a Vision-Based Page Segmentation Algorithm, Microsoft Technical Report (MSR-TR-2003-79), 2003.
|
 |
5
|
Jinlin Chen , Baoyao Zhou , Jin Shi , Hongjiang Zhang , Qiu Fengwu, Function-based object model towards website adaptation, Proceedings of the 10th international conference on World Wide Web, p.587-596, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372161]
|
| |
6
|
|
| |
7
|
|
| |
8
|
Hauser, S.E., Le D.X., and Thoma G.R., Automated zone correction in bitmapped document images, Proc. SPIE: Document Recognition and Retrieval VII, SPIE Vol. 3976, San Jose, CA, pp. 248--258, 2000.
|
| |
9
|
|
| |
10
|
Eija Kaasinen , Matti Aaltonen , Juha Kolari , Suvi Melakoski , Timo Laakko, Two approaches to bringing Internet services to WAP devices, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.231-246, June 2000, Amsterdam, The Netherlands
|
 |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
CITED BY 2
|
|
|
|
|
Emmanuel Bruno , Nicolas Faessel , Hervé Glotin , Jacques Le Maitre , Michel Scholl, Indexing by permeability in block structured web pages, Proceedings of the 9th ACM symposium on Document engineering, September 16-18, 2009, Munich, Germany
|
|