|
ABSTRACT
The contents of many valuable Web-accessible databases are only available through search interfaces and are hence invisible to traditional Web "crawlers." Recently, commercial Web sites have started to manually organize Web-accessible databases into Yahoo!-like hierarchical classification schemes. Here we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred Web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Agichtein, E. and Gravano, L. 2003. Querying text databases for efficient information extraction. In Proceedings of the Nineteenth IEEE International Conference on Data Engineering (ICDE 2003).
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
 |
6
|
|
 |
7
|
Jamie Callan , Margaret Connell , Aiqun Du, Automatic discovery of language models for text databases, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.479-490, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
8
|
Cleverdon, C. W. and Mills, J. 1963. The testing of index language devices. Aslib Proc. 15, 4, 106--130.
|
| |
9
|
Cohen, W. and Singer, Y. 1996. Learning to query the Web. In AAAI Workshop on Internet-Based Information Systems, 16--25.
|
| |
10
|
Cohen, W. W. 1996. Learning trees and rules with set-valued features. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96) Eighth Conference on Innovative Applications of Artificial Intelligence (IAAI-96), 709--716.
|
 |
11
|
Nick Craswell , Peter Bailey , David Hawking, Server selection on the World Wide Web, Proceedings of the fifth ACM conference on Digital libraries, p.37-46, June 02-07, 2000, San Antonio, Texas, United States
[doi> 10.1145/336597.336628]
|
| |
12
|
|
| |
13
|
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41, 6, 391--407.
|
 |
14
|
R. Dolin , D. Agrawal , E. El Abbadi, Scalable collection summarization and selection, Proceedings of the fourth ACM conference on Digital libraries, p.49-58, August 11-14, 1999, Berkeley, California, United States
[doi> 10.1145/313238.313257]
|
| |
15
|
Duda, R. O. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York.
|
 |
16
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
 |
17
|
Gary W. Flake , Eric J. Glover , Steve Lawrence , C. Lee Giles, Extracting query modifications from nonlinear SVMs, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511488]
|
| |
18
|
Gauch, S., Wang, G., and Gomez, M. 1996. ProFusion*: Intelligent fusion from multiple, distributed search engines. J. Univ. Comput. Sci. 2, 9 (Sept.), 637--649.
|
 |
19
|
|
 |
20
|
|
| |
21
|
Gravano, L., Ipeirotis, P. G., and Sahami, M. 2002. Query- vs. crawling-based classification of searchable web databases. IEEE Data Eng. Bull. 25, 1 (Mar.), 43--50.
|
| |
22
|
Grefenstette, G. and Nioche, J. 2000. Estimation of English and non-English language use on the WWW. In Recherche d'Information Assistée par Ordinateur (RIAO 2000).
|
 |
23
|
|
| |
24
|
Ipeirotis, P. G. and Gravano, L. 2002. Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002).
|
 |
25
|
|
 |
26
|
Panagiotis G. Ipeirotis , Luis Gravano , Mehran Sahami, Probe, count, and classify: categorizing hidden web databases, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.67-78, May 21-24, 2001, Santa Barbara, California, United States
|
| |
27
|
|
| |
28
|
Johnston, R. 1971. Gershgorin theorems for partitioned matrices. Lin. Algeb. Appl. 4, 3 (July), 205--220.
|
| |
29
|
|
| |
30
|
|
| |
31
|
Koller, D. and Sahami, M. 1996. Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on Machine Learning (ICML'96), 284--292.
|
| |
32
|
|
| |
33
|
Koster, M. 2002. Robots exclusion standard. Available at http://www.robotstxt.org/.
|
 |
34
|
David D. Lewis , Robert E. Schapire , James P. Callan , Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland
[doi> 10.1145/243199.243277]
|
| |
35
|
McCallum, A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In Learning for Text Categorization: Papers from the 1998 AAAI Workshop, 41--48.
|
| |
36
|
Weiyi Meng , King-Lup Liu , Clement T. Yu , Xiaodong Wang , Yuhsi Chang , Naphtali Rishe, Determining Text Databases to Search in the Internet, Proceedings of the 24rd International Conference on Very Large Data Bases, p.14-25, August 24-27, 1998
|
| |
37
|
|
| |
38
|
|
| |
39
|
|
| |
40
|
|
| |
41
|
|
| |
42
|
|
| |
43
|
Rocchio, J. 1971. Relevance feedback in information retrieval. In The SMART Information Retrieval System. Prentice-Hall, Englewood Cliffs, NJ, 313--323.
|
| |
44
|
|
| |
45
|
|
| |
46
|
|
| |
47
|
|
 |
48
|
Hinrich Schütze , David A. Hull , Jan O. Pedersen, A comparison of classifiers and document representations for the routing problem, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.229-237, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215365]
|
| |
49
|
|
| |
50
|
|
| |
51
|
|
 |
52
|
|
 |
53
|
|
| |
54
|
Yangarber, R. and Grishman, R. 1998. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7).
|
| |
55
|
Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading, MA.
|
CITED BY 31
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hassan H. Malik , John R. Kender, Clustering web images using association rules, interestingness measures, and hypergraph partitions, Proceedings of the 6th international conference on Web engineering, July 11-14, 2006, Palo Alto, California, USA
|
|
|
|
|
|
Milad Shokouhi , Justin Zobel , Falk Scholer , S. M. M. Tahaghoghi, Capturing collection size for distributed non-cooperative retrieval, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
|
|
|
Jon Kleinberg, Social networks, incentives, and search, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, p.210-211, August 06-11, 2006, Seattle, Washington, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Manuel Álvarez , Juan Raposo , Alberto Pan , Fidel Cacheda , Fernando Bellas , Víctor Carneiro, DeepBot: a focused crawler for accessing hidden web content, Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07), p.18-25, June 12-12, 2007, San Diego, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|