|
ABSTRACT
We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
L. Barbosa and J. Freire. Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proc. of SBBD, pages 309--321, 2004.
|
| |
3
|
L. Barbosa and J. Freire. Searching for Hidden-Web Databases. In Proceedings of WebDB, pages 1--6, 2005.
|
| |
4
|
L. Barbosa and J. Freire. Organizing hidden-web databases by clustering visible web documents. In Proceedings of ICDE, 2007. To appear.
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
Brightplanet's searchable databases directory. http://www.completeplanet.com.
|
 |
9
|
|
| |
10
|
|
| |
11
|
K. C.-C. Chang, B. He, and Z. Zhang. Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web. In Proc. of CIDR, pages 44--55, 2005.
|
| |
12
|
|
| |
13
|
|
| |
14
|
Y. Even-Zohar and D. Roth. A sequential model for multi-class classification. In Empirical Methods in Natural Language Processing, 2001.
|
| |
15
|
M. Galperin. The molecular biology database collection: 2005 update. Nucleic Acids Res, 33, 2005.
|
| |
16
|
|
 |
17
|
|
 |
18
|
|
 |
19
|
|
| |
20
|
H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of VLDB, pages 357--368, 2003.
|
| |
21
|
B. Heisele, T. Serreb, S. Prenticeb, and T. Poggiob. Hierarchical Classification and Feature Reduction for Fast face Detection with Support Vector Machines. Pattern Recognition, 36(9), 2003.
|
| |
22
|
A. Hess and N. Kushmerick. Automatically attaching semantic metadata to web services. In Proceedings of IIWeb, pages 111--116, 2003.
|
 |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
Y. Ru and E. Horowitz. Indexing the invisible Web: a survey. Online Information Review, 29(3):249--265, 2005.
|
| |
28
|
E. H. Simpson. Measurement of Diversity. Nature, 163:688, 1949.
|
| |
29
|
S. Sizov, M. Biwer, J. Graupmann, S. Siersdorfer, M. Theobald, G. Weikum, and P. Zimmer. The BINGO! System for Information Portal Generation and Expert Web Search. In Proc. of CIDR, 2003.
|
| |
30
|
The UIUC Web integration repository. http://metaquerier.cs.uiuc.edu/repository.
|
| |
31
|
|
| |
32
|
|
 |
33
|
|
CITED BY 4
|
|
|
|
|
|
|
|
Marie Dominique Devignes , Philippe Franiatte , Nizar Messai , Amedeo Napoli , Malika Smail-Tabbone, BioRegistry: automatic extraction of metadata for biological database retrieval and discovery, Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services, November 24-26, 2008, Linz, Austria
|
|
|
|
|