ACM Home Page
Please provide us with feedback. Feedback
Combining classifiers to identify online databases
Full text PdfPdf (1.11 MB)
Source
International World Wide Web Conference archive
Proceedings of the 16th international conference on World Wide Web table of contents
Banff, Alberta, Canada
SESSION: Crawlers table of contents
Pages: 431 - 440  
Year of Publication: 2007
ISBN:978-1-59593-654-7
Authors
Luciano Barbosa  University of Utah, Salt Lake City, UT
Juliana Freire  University of Utah, Salt Lake City, UT
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 84,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1242572.1242631
What is a DOI?

ABSTRACT

We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
L. Barbosa and J. Freire. Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proc. of SBBD, pages 309--321, 2004.
 
3
L. Barbosa and J. Freire. Searching for Hidden-Web Databases. In Proceedings of WebDB, pages 1--6, 2005.
 
4
L. Barbosa and J. Freire. Organizing hidden-web databases by clustering visible web documents. In Proceedings of ICDE, 2007. To appear.
5
 
6
 
7
 
8
Brightplanet's searchable databases directory. http://www.completeplanet.com.
9
 
10
 
11
K. C.-C. Chang, B. He, and Z. Zhang. Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web. In Proc. of CIDR, pages 44--55, 2005.
 
12
 
13
 
14
Y. Even-Zohar and D. Roth. A sequential model for multi-class classification. In Empirical Methods in Natural Language Processing, 2001.
 
15
M. Galperin. The molecular biology database collection: 2005 update. Nucleic Acids Res, 33, 2005.
 
16
17
18
19
 
20
H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of VLDB, pages 357--368, 2003.
 
21
B. Heisele, T. Serreb, S. Prenticeb, and T. Poggiob. Hierarchical Classification and Feature Reduction for Fast face Detection with Support Vector Machines. Pattern Recognition, 36(9), 2003.
 
22
A. Hess and N. Kushmerick. Automatically attaching semantic metadata to web services. In Proceedings of IIWeb, pages 111--116, 2003.
23
 
24
 
25
 
26
 
27
Y. Ru and E. Horowitz. Indexing the invisible Web: a survey. Online Information Review, 29(3):249--265, 2005.
 
28
E. H. Simpson. Measurement of Diversity. Nature, 163:688, 1949.
 
29
S. Sizov, M. Biwer, J. Graupmann, S. Siersdorfer, M. Theobald, G. Weikum, and P. Zimmer. The BINGO! System for Information Portal Generation and Expert Web Search. In Proc. of CIDR, 2003.
 
30
The UIUC Web integration repository. http://metaquerier.cs.uiuc.edu/repository.
 
31
 
32
33


Collaborative Colleagues:
Luciano Barbosa: colleagues
Juliana Freire: colleagues