ACM Home Page
Please provide us with feedback. Feedback
Automated gathering of Web information: An in-depth examination of agents interacting with search engines
Full text PdfPdf (386 KB)
Source ACM Transactions on Internet Technology (TOIT) archive
Volume 6 ,  Issue 4  (November 2006) table of contents
Pages: 442 - 464  
Year of Publication: 2006
ISSN:1533-5399
Authors
Bernard J. Jansen  The Pennsylvania State University
Tracy Mullen  The Pennsylvania State University
Amanda Spink  The University of Pittsburgh
Jan Pedersen  Overture Services, Inc.
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 26,   Downloads (12 Months): 209,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183463.1183468
What is a DOI?

ABSTRACT

The Web has become a worldwide repository of information which individuals, companies, and organizations utilize to solve or address various information problems. Many of these Web users utilize automated agents to gather this information for them. Some assume that this approach represents a more sophisticated method of searching. However, there is little research investigating how Web agents search for online information. In this research, we first provide a classification for information agent using stages of information gathering, gathering approaches, and agent architecture. We then examine an implementation of one of the resulting classifications in detail, investigating how agents search for information on Web search engines, including the session, query, term, duration and frequency of interactions. For this temporal study, we analyzed three data sets of queries and page views from agents interacting with the Excite and AltaVista search engines from 1997 to 2002, examining approximately 900,000 queries submitted by over 3,000 agents. Findings include: (1) agent sessions are extremely interactive, with sometimes hundreds of interactions per second (2) agent queries are comparable to human searchers, with little use of query operators, (3) Web agents are searching for a relatively limited variety of information, wherein only 18% of the terms used are unique, and (4) the duration of agent-Web search engine interaction typically spans several hours. We discuss the implications for Web information agents and search engines.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
5
 
6
Brandman, O., Cho, J., Garcia-Molina, H. and Shivakumar, N. 2000. Crawler-friendly Web servers. In Proceedings of the Workshop on Performance and Architecture of Web Servers (PAWS). Santa Clara, California.
7
 
8
Brody, R. 2000. Illusions of plenty: The role of search engines in the structure and suppression of knowledge. In Proceedings of the IEEE International Symposium on Technology and Society. Rome, Italy, 157--161.
 
9
Budzik, J. and Hammond, K. 1999. Watson: Anticipating and Contextualizing Information Needs. In Proceedings of the 60nd Annual Meeting of the American Society for Information Science. 727--740.
 
10
Cappelli, P. 2001. Making the most of online recruiting. Harvard Bus. Rev. 79,3, 139--146.
11
 
12
13
14
15
 
16
 
17
Cyber Atlas. 1999. U.S. top 50 internet properties, Dec. 1999, at home/work combined. 1 (July 2000).
 
18
Cyber Atlas. 2001. U.S. top 50 internet properties, (May 2001) at home/work combined. (July 2000).
 
19
Cyber Atlas. 2002. (Nov. 2002) internet usage stats. (Jan. 2002).
 
20
21
 
22
Dumais, S. T. 2002. Web experiments and test collections. The 11th International World Wide Web Conference. 2003 (April).
23
 
24
Etzioni, O. 1996a. Moving Up the information food chain: Deploying softbots on the World Wide Web. In Proceedings of the 13th National Conference on Artificial Intelligence and the 8th Innovative Applications of Artificial Intelligence Conference. 1322--1326.
25
26
 
27
28
 
29
 
30
 
31
 
32
 
33
 
34
 
35
 
36
 
37
 
38
Jansen, B. J. and Spink, A. 2003. An analysis of Web information seeking and use: Documents retrieved versus documents viewed. In Proceedings of the 4th International Conference on Internet Computing. Las Vegas, NV, 65--69.
 
39
Jansen, B. J. and Spink, A. 2005. An analysis of Web searching by European Alltheweb.com users. Inform. Process. Manag. 42, 1, 248--263.
40
 
41
Jansen, B. J., Spink, A. and Pederson, J. 2003a. Monsters at the gates: When Softbots visit web search engines. In Proceedings of the 4th International Conference on Internet Computing. Las Vegas, NV, 620--626.
 
42
Jansen, B. J., Spink, A. and Pederson, J. 2003b. Web searching agents: What are they doing out there? In Proceedings of the 2003 IEEE International Conference on Systems, Man and Cybernetics. Washington, DC, 10--16.
 
43
 
44
 
45
Joachims, T., Freitag, D. and Mitchell, T. 1997. WebWatcher: A tour guide for the World Wide Web. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI 97). 770--775.
 
46
Jones, W. 2004. Finders, keepers? The present and future perfect in support of personal information management. First Monday. 9, 3.
 
47
 
48
Knoblock, C. A., Minton, S., Ambite, J. L., Ashish, N., Muslea, I., Philpot, A. G. and Tejada, S. 2001a. The Ariadne approach to Web-based information integration. Int. J. Coopera. Inform. Syst. (IJCIS). 10, 12, 145--169.
 
49
Koster, M. 1998. The Web robots FAQ. www.robotstxt.org/wc/faq.html 15 (March 2002).
 
50
Lawrence, S. 2001. Online or invisible? Nature. 411,6837, 521.
 
51
 
52
Lee, G., Lee, J.-H., Rho, H., Park, Y.-T., Choi, J. and Seo, J. 1998. Interactive NLI agent for multiagent Web search model. In Proceedings of the International Workshop on Intelligent Agents on the Internet and Web, in 4th World Congress on Expert Systems. Mexico City, Mexico, 67--74.
53
 
54
 
55
56
57
58
 
59
 
60
 
61
 
62
Munarriz, R. A. 1997. How did it double? www.tool.com/ddouble/1997/ddouble 970812 html/. 10 November,
 
63
64
 
65
 
66
 
67
 
68
Searchtools.Com. 2001. Source Code for Web Robot Spiders.
 
69
Selberg, E. and Etzioni, O. 1995. Multi-service search and comparison using the metacrawler. In Proceedings of the 4th International World-Wide Web Conference. Boston, MA.
 
70
71
 
72
73
 
74
Spink, A. and Jansen, B. J. 2004. Web Search: Public Searching of the Web. Kluwer, New York, NY.
 
75
 
76
Sullivan, D. 2002. Search Engine Math. www.searchenginewatch.com/showPage.html 11 April,
 
77
Sullivan, D. 2003. Search Utilities. www.searchenginewatch.com 16 (March 2002).
78
79
80
81
 
82
Xiaohui, Z., Huayong, W., Guiran, C. and Hong, Z. 2001. An autonomous system-based distribution system for Web search. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. Tucson, AZ, 435--440.
83
84


Collaborative Colleagues:
Bernard J. Jansen: colleagues
Tracy Mullen: colleagues
Amanda Spink: colleagues
Jan Pedersen: colleagues