|
ABSTRACT
The contents of many valuable web-accessible databases are only accessible through search interfaces and are hence invisible to traditional web “crawlers.” Recent studies have estimated the size of this “hidden web” to be 500 billion pages, while the size of the “crawlable” web is only an estimated two billion pages. Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. In this paper, we introduce a method for automating this classification process by using a small number of query probes. To classify a database, our algorithm does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of our technique over collections of real documents, including over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Jamie Callan , Margaret Connell , Aiqun Du, Automatic discovery of language models for text databases, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.479-490, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
3
|
C. W. Cleverdon and J. Mills. The testing of index language devices. Aslib Proceedings, 15(4):106-130, 1963.
|
| |
4
|
W. W. Cohen. Learning trees and rules with set-valued features. In Proceedings of AAAI'96, IAAI'96, volume 1, pages 709-716. AAAI, 1996.
|
 |
5
|
Nick Craswell , Peter Bailey , David Hawking, Server selection on the World Wide Web, Proceedings of the fifth ACM conference on Digital libraries, p.37-46, June 02-07, 2000, San Antonio, Texas, United States
[doi> 10.1145/336597.336628]
|
| |
6
|
The Deep Web: Surfacing Hidden Value. Accessible at http://www.completeplanet.com/Tutorials/DeepWeb/index.asp.
|
 |
7
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
8
|
S. Gauch, G. Wang, and M. Gomez. Profusion*: Intelligent fusion from multiple, distributed search engines. The Journal of Universal Computer Science, 2(9):637-649, Sept. 1996.
|
 |
9
|
|
| |
10
|
G. Grefenstette and J. Nioche. Estimation of English and non-English language use on the WWW. In RIAO 2000, 2000.
|
 |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
R. L. Johnston. Gershgorin theorems for partitioned matrices. Linear Algebra and its Applications, 4:205-220, 1971.
|
| |
15
|
D. Koller and M. Sahami. Toward optimal feature selection. In Machine Learning, Proceedings of the Thirteenth International Conference (ICML '96), pages 284-292, 1996.
|
| |
16
|
|
 |
17
|
David D. Lewis , Robert E. Schapire , James P. Callan , Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland
[doi> 10.1145/243199.243277]
|
| |
18
|
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In Learning for Text Categorization: Papers from the 1998 AAAI Workshop, 1998.
|
| |
19
|
Weiyi Meng , King-Lup Liu , Clement T. Yu , Xiaodong Wang , Yuhsi Chang , Naphtali Rishe, Determining Text Databases to Search in the Internet, Proceedings of the 24rd International Conference on Very Large Data Bases, p.14-25, August 24-27, 1998
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
J. J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Information Retrieval System, pages 313-323. Prentice Hall, Englewood Cliffs, NJ, 1971.
|
| |
24
|
|
| |
25
|
|
 |
26
|
Hinrich Schütze , David A. Hull , Jan O. Pedersen, A comparison of classifiers and document representations for the routing problem, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.229-237, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215365]
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
 |
30
|
|
 |
31
|
|
| |
32
|
G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
|
CITED BY 37
|
|
|
|
|
|
|
|
Zaiqing Nie , Subbarao Kambhampati , Ullas Nambiar , Sreelakshmi Vaddi, Mining source coverage statistics for data integration, Proceedings of the 3rd international workshop on Web information and data management, November 09-01, 2001, Atlanta, Georgia, USA
|
|
|
|
|
|
Pável P. Calado , Marcos A. Gonçalves , Edward A. Fox , Berthier Ribeiro-Neto , Alberto H. F. Laender , Altigran S. da Silva , Davi C. Reis , Pablo A. Roberto , Monique V. Vieira , Juliano P. Lage, The Web-DL environment for building digital libraries from the Web, Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, May 27-31, 2003, Houston, Texas
|
|
|
Kathleen R. McKeown , Shih-Fu Chang , James Cimino , Steven Feiner , Carol Friedman , Luis Gravano , Vasileios Hatzivassiloglou , Steven Johnson , Desmond A. Jordan , Judith L. Klavans , André Kushniruk , Vimla Patel , Simone Teufel, PERSIVAL, a system for personalized search and summarization over multimedia healthcare information, Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, p.331-340, January 2001, Roanoke, Virginia, United States
|
|
|
|
|
|
|
|
|
Zaiqing Nie , Ullas Nambiar , Sreelakshmi Vaddi , Subbarao Kambhampati, Mining coverage statistics for websource selection in a mediator, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
Qian Peng , Weiyi Meng , Hai He , Clement Yu, Clustering e-commerce search engines, Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, May 19-21, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Milad Shokouhi , Justin Zobel , Falk Scholer , S. M. M. Tahaghoghi, Capturing collection size for distributed non-cooperative retrieval, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|
|
|
|
|
|
|
|
Michael L. Nelson , Joan A. Smith , Ignacio Garcia del Campo, Efficient, automatic web resource harvesting, Proceedings of the eighth ACM international workshop on Web information and data management, November 10-10, 2006, Arlington, Virginia, USA
|
|
|
Ronak Desai , Qi Yang , Zonghuan Wu , Weiyi Meng , Clement Yu, Identifying redundant search engines in a very large scale metasearch engine context, Proceedings of the eighth ACM international workshop on Web information and data management, November 10-10, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Peter Mork , Ken Smith , Barbara Blaustein , Chris Wolf , Keri Sarver, Facilitating discovery on the private web using dataset digests, Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services, November 24-26, 2008, Linz, Austria
|
|
|
|
|