| Panorama: extending digital libraries with topical crawlers |
| Full text |
Pdf
(1.16 MB)
|
| Source
|
International Conference on Digital Libraries
archive
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
table of contents
Tuscon, AZ, USA
SESSION: Crawling the web
table of contents
Pages: 142 - 150
Year of Publication: 2004
ISBN:1-58113-832-6
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 2, Downloads (12 Months): 46, Citation Count: 10
|
|
|
ABSTRACT
A large amount of research, technical and professional documents are available today in digital formats Digital libraries are created to facilitate search and retrieval of information supplied by the documents. These libraries may span an entire area of interest (e.g., computer science) or be limited to documents within a small organization. While tools that index, classify, rank and retrieve documents from such libraries are important, it would be worthwhile to complement these tools with information available on the Web. We propose one such technique that uses a topical crawler driven by the information extracted from a research document. The goal of the crawler is to harvest a collection of Web pages that are focused on the topical subspaces associated with the given document. The collection created through Web crawling is further processed using lexical and linkage analysis. The entire process is automated and uses machine learning techniques to both guide the crawler as well as analyze the collection it fetches. A report is generated at the end that provides visual cues and information to the researcher.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
|
 |
4
|
|
 |
5
|
Kurt D. Bollacker , Steve Lawrence , C. Lee Giles, A system for automatic personalized tracking of scientific literature on the Web, Proceedings of the fourth ACM conference on Digital libraries, p.105-113, August 11-14, 1999, Berkeley, California, United States
[doi> 10.1145/313238.313270]
|
| |
6
|
Pável P. Calado , Marcos A. Gonçalves , Edward A. Fox , Berthier Ribeiro-Neto , Alberto H. F. Laender , Altigran S. da Silva , Davi C. Reis , Pablo A. Roberto , Monique V. Vieira , Juliano P. Lage, The Web-DL environment for building digital libraries from the Web, Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, May 27-31, 2003, Houston, Texas
|
 |
7
|
|
| |
8
|
|
 |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
Mark Craven , Dan DiPasquo , Dayne Freitag , Andrew McCallum , Tom Mitchell , Kamal Nigam , Seán Slattery, Learning to extract symbolic knowledge from the World Wide Web, Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, p.509-516, July 1998, Madison, Wisconsin, United States
|
 |
13
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
| |
14
|
|
| |
15
|
|
| |
16
|
Hui Han , C. Lee Giles , Eren Manavoglu , Hongyuan Zha , Zhenyue Zhang , Edward A. Fox, Automatic document metadata extraction using support vector machines, Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, May 27-31, 2003, Houston, Texas
|
| |
17
|
J. Johnson, K. Tsioutsiouliklis, and C. L. Giles Evolving strategies for focused Web crawling. In Proc 20th Intl Conference on Machine Learning (ICML 2003), Washington DC, 2003.
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
 |
21
|
Filippo Menczer , Gautam Pant , Padmini Srinivasan , Miguel E. Ruiz, Evaluating topic-driven web crawlers, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.241-249, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383995]
|
| |
22
|
M. Porter. An algorithm for suffix stripping Program, 14(3):130--137, 1980.
|
| |
23
|
R. Kumar , P. Raghavan , S. Rajagopalan , D. Sivakumar , A. Tomkins , E. Upfal, Stochastic models for the Web graph, Proceedings of the 41st Annual Symposium on Foundations of Computer Science, p.57, November 12-14, 2000
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
 |
27
|
|
| |
28
|
|
| |
29
|
|
| |
30
|
|
 |
31
|
|
CITED BY 11
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Guilherme T. de Assis , Alberto H. F. Laender , Altigran S. da Silva , Marcos André Gonçalves, The impact of term selection in genre-aware focused crawling, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|