|
ABSTRACT
Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete coverage of Web resources. On the other hand, health information obtained through whole-of-Web search may not be scientifically based and can be potentially harmful.To address problems of cost, coverage and quality, we built a focused crawler for the mental health topic of depression, which was able to selectively fetch higher quality relevant information. We found that the relevance of unfetched pages can be predicted based on link anchor context, but the quality cannot. We therefore estimated quality of the entire linking page, using a learned IR-style query of weighted single words and word pairs, and used this to predict the quality of its links. The overall crawler priority was determined by the product of link relevance and source quality.We evaluated our crawler against baseline crawls using both relevance judgments and objective site quality scores obtained using an evidence-based rating scale. Both a relevance focused crawler and the quality focused crawler retrieved twice as many relevant pages as a breadth-first control. The quality focused crawler was quite effective in reducing the amount of low quality material fetched while crawling more high quality content, relative to the relevance focused crawler.Analysis suggests that quality of content might be improved by post-filtering a very big breadth-first crawl, at the cost of substantially increased network traffic.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
L. Baker, T. H. Wagner, S. Singer, and M. K. Bundorf. Use of the internet and e-mail for health care information. JAMA, 289(18):2400--2406, 2003.
|
| |
3
|
P. D. Bra, G. Houben, Y. Kornatzky, and R. Post. Information retrieval in distributed hypertexts. In Procs. of the 4th RIAO Conference, pages 481--491, New York, 1994.
|
| |
4
|
|
| |
5
|
CEBMH. A systematic guide for the management of depression in primary care: treatment. University of Oxford, UK, 1998. Available at http://cebmh.warne.ox.ac.uk/cebmh/guidelines/ depression/treatment.html, Accessed 30 May 2005.
|
| |
6
|
|
| |
7
|
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text, Proceedings of the seventh international conference on World Wide Web 7, p.65-74, April 1998, Brisbane, Australia
|
| |
8
|
D. Charnock, S. Shepperd, G. Needham, and R. Gann. Discern: an instrument for judging the quality of written consumer health information on treatment choices. J. Epidemiol Community Health, 53:105--111, 1999.
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
K. Griffiths and H. Christensen. Quality of web based information on treatment of depression: cross sectional survey. British Medical Journal, 321:1511- 1515, 2000. bmj.bmjjournals.com/cgi/content/full/321/7275/1511.
|
| |
13
|
K. Griffiths and H. Christensen. The quality and accessibility of australian depression sites on the world wide web. The Medical Journal of Australia, 176:S97--S104, 2002.
|
| |
14
|
K. Griffiths, H. Christensen, and S. Blomberg. Website quality indicators for consumers. In Tromso Telemedicine and e-Health Conf., Tromso, Norway, 2004.
|
 |
15
|
|
| |
16
|
Michael Hersovici , Michal Jacovi , Yoelle S. Maarek , Dan Pelleg , Menanchem Shtalhaim , Sigalit Ur, The shark-search algorithm. An application: tailored Web site mapping, Proceedings of the seventh international conference on World Wide Web 7, p.317-326, April 1998, Brisbane, Australia
|
| |
17
|
A. R. Jadad and A. Gagliardi. Rating health information on the internet. JAMA, 279:611--614, 1998.
|
| |
18
|
R. Kiley. Quality of medical information on the internet. J. Royal Soc. of Med., 91:369--370, 1998.
|
| |
19
|
D. D. Margineantu and T. G. Dietterich. Improved class probability estimates from decision tree models. In D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, and B. Yu, editors, Lecture Notes in Statistics. Nonlinear Estimation and Classification, volume 171, pages 169--184, New York, 2002. Springer-Verlag.
|
| |
20
|
A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domain-specific search engines with machine learning technique. In Procs. of AAAI Spring Symposium on Intelligents Engine in Cyberspace, 1999.
|
| |
21
|
S. L. Price and W. R. Hersh. Filtering web pages for quality indicators: An empirical approach to finding high quality consumer health information on the world wide web. In Procs. of the AMIA 1999 Annual Symposium, pages 911--915, Washington DC, 1999.
|
| |
22
|
|
| |
23
|
A. Risk and J. Dzenowagis. Review of internet health information quality initiatives. JMIR, 3(4):e28, 2001.
|
| |
24
|
|
| |
25
|
S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129--146, 1976.
|
| |
26
|
S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In Procs. of the Third Text REtrieval Conference, pages 109--126, USA, 1996.
|
| |
27
|
W. M. Silberg, G. D. Lundberg, and R. A. Musacchio. Assessing, controlling, and assuring the quality of medical information on the internet. JAMA, 277:1244--1245, 1997.
|
| |
28
|
T. T. Tang, N. Craswell, D. Hawking, K. M. Griffiths, and H. Christensen. Quality and relevance of domain-specific search: A case study in mental health. To appear in the Journal of Information Retrieval -Special Issues, 2005.
|
| |
29
|
T. T. Tang, D. Hawking, N. Craswell, and R. S. Sankaranarayana. Focused crawling in depression portal search: A feasibility study. In Procs. of the Ninth ADCS, pages 2--9, Australia, 2004.
|
| |
30
|
|
CITED BY 4
|
|
|
|
|
Ismail Sengor Altingovde , Rifat Ozcan , Suleyman Cetintas , Hakan Yilmaz , Özgür Ulusoy, An automatic approach to construct domain-specific web portals, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
|
|
|
|
|