ACM Home Page
Please provide us with feedback. Feedback
Indexing by permeability in block structured web pages
Full text PdfPdf (370 KB)
Source
Document Engineering archive
Proceedings of the 9th ACM symposium on Document engineering table of contents
Munich, Germany
SESSION: Document analysis (I) table of contents
Pages 70-73  
Year of Publication: 2009
ISBN:978-1-60558-575-8
Authors
Emmanuel Bruno  Univsersité du Sud Toulon-Var, La Garde, France
Nicolas Faessel  Université Paul Cézanne, Marseille, France
Hervé Glotin  Univsersité du Sud Toulon-Var, La Garde, France
Jacques Le Maitre  Univsersité du Sud Toulon-Var, La Garde, France
Michel Scholl  CNAM, Paris, France
Sponsors
SIGDOC: ACM Special Interest Group for Design of Communications
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 9,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1600193.1600209
What is a DOI?

ABSTRACT

We present in this paper a model that we have developed for indexing and querying web pages based on their visual rendering. In this model pages are split up into a set of visual blocks. The indexing of a block takes into account its content, its visual importance and, by permeability, the indexing of neighbors blocks. A page is modeled as a directed acyclic graph. Each node is associated with a block and labeled by the coefficient of importance of this block. Each edge is labeled by the coefficient of permeability of the target node content to the source node content. Importance and permeability coefficients cannot be manually quantified. the second part of this paper, we present an experiment consisting in learning optimal permeability coefficients by gradient descent for indexing images of a web page from the text blocks of this page. The dataset is drawn from real web pages of the train and test set of the ImagEval task2 corpus. Results demonstrate an improvement of the indexing using non uniform block permeabilities.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
E. Bruno, N. Faessel, J. Le Maitre, and M. Scholl. BlockWeb: an IR model for block structured web pages. In Proceedings of 7th International Workshop on Content Based Multimedia Indexing (CBMI 2009), pages 219--224, Chania, Crete, 2009.
 
2
D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting content structure for web pages based on visual representation. In Proceedings of the 5th Asian-Pacific Web Conference (APWeb 2003), volume 2642, pages 406--417, Xian, China, 2003.
 
3
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, 2000.
 
4
G. Nagy, S. C. Seth, and M. Viswanathan. A prototype document image analysis system for technical journals. IEEE Computer, 25(7):10--22, 1992.
 
5
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613--620, 1975.
 
6
R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma. Learning block importance models for web pages. In Proceedings of the 13th International Conference on World Wide Web (WWW 2004), pages 203--211, Manhattan, NY, USA, 2004.
 
7
J. Zou, D. Le, and G. R. Thoma. Combining DOM tree and geometric layout analysis for online medical journal article segmentation. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries pages 119--128, Chapel Hill, North Carolina, USA, 2006.