|
ABSTRACT
The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
 |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text, Proceedings of the seventh international conference on World Wide Web 7, p.65-74, April 1998, Brisbane, Australia
|
 |
7
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
8
|
Soumen Chakrabarti , Byron E. Dom , David Gibson , Ravi Kumar , Prabhakar Raghavan , Sridhar Rajagopalan , Andrew Tomkins, Topic Distillation and Spectral Filtering, Artificial Intelligence Review, v.13 n.5-6, p.409-435, Dec. 1999
[doi> 10.1023/A:1006596506229]
|
| |
9
|
|
| |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
P. M. E. De Bra and R. D. J. Post. Searching for arbitrary information in the WWW: The fish search for Mosaic. In Second World Wide Web Conference '94: Mosaic and the Web, Chicago, Oct. 1994. Online at http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/debra/article.html and http://citeseer.nj.nec.com/172936.html.
|
| |
14
|
|
| |
15
|
W. A. Gale, K. W. Church, and D. Yarowsky. A method for disambiguating word senses in a large corpus. Computer and the Humanities, 26:415--439, 1993.
|
| |
16
|
Michael Hersovici , Michal Jacovi , Yoelle S. Maarek , Dan Pelleg , Menanchem Shtalhaim , Sigalit Ur, The shark-search algorithm. An application: tailored Web site mapping, Proceedings of the seventh international conference on World Wide Web 7, p.317-326, April 1998, Brisbane, Australia
|
| |
17
|
T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A tour guide for the web. In IJCAI, Aug. 1997. Online at http://www.cs.cmu.edu/~webwatcher/ijcai97.ps.
|
| |
18
|
H. Leiberman. Letizia: An agent that assists Web browsing. In International Joint Conference on Artificial Intelligence (IJCAI), Montreal, Aug. 1995. See Website at http://lieber.www.media.mit.edu/people/lieber/Lieberary/Letizia/Letizia.html.
|
 |
19
|
|
| |
20
|
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Software available from http://www.cs.cmu.edu/~mccallum/bow/, 1998.
|
| |
21
|
A. McCallum and K. Nigam. A comparison of event models for naive Bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41--48. AAAI Press, 1998. Online at http://www.cs.cmu.edu/~knigam/.
|
| |
22
|
A. McCallum and K. Nigam. A comparison of event models for naive Bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41--48. AAAI Press, 1998. Also technical report WS-98-05, CMU; online at http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf.
|
| |
23
|
F. Menczer. Links tell us about lexical and semantic Web content. Technical Report Computer Science Abstract CS.IR/0108004, arXiv.org, Aug. 2001. Online at http://arxiv.org/abs/cs.IR/0108004.
|
| |
24
|
|
 |
25
|
Filippo Menczer , Gautam Pant , Padmini Srinivasan , Miguel E. Ruiz, Evaluating topic-driven web crawlers, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.241-249, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383995]
|
| |
26
|
|
| |
27
|
T. Mitchell. Mining the Web. In SIGIR 2001, Sept. 2001. Invited talk.
|
| |
28
|
|
| |
29
|
R. Kumar , P. Raghavan , S. Rajagopalan , D. Sivakumar , A. Tomkins , E. Upfal, Stochastic models for the Web graph, Proceedings of the 41st Annual Symposium on Foundations of Computer Science, p.57, November 12-14, 2000
|
| |
30
|
|
| |
31
|
|
| |
32
|
M. Subramanyam, G. V. R. Phanindra, M. Tiwari, and M. Jain. Focused crawling using TFIDF centroid. Hypertext Retrieval and Mining (CS610) class project, Apr. 2001. Details available from manyam@cs.utexas.edu.
|
| |
33
|
|
CITED BY 38
|
|
|
|
|
|
|
|
Gautam Pant , Kostas Tsioutsiouliklis , Judy Johnson , C. Lee Giles, Panorama: extending digital libraries with topical crawlers, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
|
|
|
|
|
|
Soumen Chakrabarti , Mukul M. Joshi , Kunal Punera , David M. Pennock, The structure of broad topics on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
|
|
|
|
|
|
|
|
|
|
|
|
Márcio L. A. Vidal , Altigran S. da Silva , Edleno S. de Moura , João M. B. Cavalcanti, Structure-driven crawler generation by example, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rashmin Babaria , J. Saketha Nath , Krishnan S , Sivaramakrishnan K R , Chiranjib Bhattacharyya , M. N. Murty, Focused crawling with scalable ordinal regression solvers, Proceedings of the 24th international conference on Machine learning, p.57-64, June 20-24, 2007, Corvalis, Oregon
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Pedro DeRose , Warren Shen , Fei Chen , AnHai Doan , Raghu Ramakrishnan, Building structured web community portals: a top-down, compositional, and incremental approach, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Zhumin Chen , Jun Ma , Jingsheng Lei , Bo Yuan , Li Lian , Ling Song, A cross-language focused crawling algorithm based on multiple relevance prediction strategies, Computers & Mathematics with Applications, v.57 n.6, p.1057-1072, March, 2009
|
|
|
|
|
|
|
|
|
|
|