|
ABSTRACT
Many online data sources are updated autonomously and independently. In this article, we make the case for estimating the change frequency of data to improve Web crawlers, Web caches and to help data mining. We first identify various scenarios, where different applications have different requirements on the accuracy of the estimated frequency. Then we develop several "frequency estimators" for the identified scenarios, showing analytically and experimentally how precise they are. In many cases, our proposed estimators predict change frequencies much more accurately and improve the effectiveness of applications. For example, a Web crawler could achieve 35% improvement in "freshness" simply by adopting our proposed estimator.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Baentsch, M., Baum, L., Molter, G., Rothkugel, S., and Sturm, P. 1997. World Wide Web caching: The application-level view of the internet. IEEE Commun. 35, 6 (June), 170--178.
|
| |
2
|
Bernardo, J. and Smith, A. 1994. Bayesian Theory. Wiley, New York.
|
| |
3
|
|
| |
4
|
|
| |
5
|
Canavos, G. 1972. A Bayesian approach to parameter and reliability estimation in the Poisson distribution. IEEE Trans. Reliab. R21, 52--56.
|
| |
6
|
|
 |
7
|
|
| |
8
|
Cho, J. and Garcia-Molina, H. 2002c. Estimating frequency of change. Tech. Rep., Univ. California, Los Angeles, Calif.
|
| |
9
|
Coffman, Jr., E., Liu, Z., and Weber, R. R. 1998. Optimal robot scheduling for web search engines. J. Sched. 1, 1 (June), 15--29.
|
| |
10
|
Courant, R. and David, H. 1989. Methods of mathematical physics, 1st ed. Wiley, New York.
|
| |
11
|
Douglis, F., Feldmann, A., Krishnamurthy, B., and Mogul, J. 1999. Rate of change and other metrics: a live study of the world wide web. In USENIX Symposium on Internetworking Technologies and Systems.
|
 |
12
|
|
| |
13
|
Gwertzman, J. and Seltzer, M. 1996. World-Wide Web cache consistency. In Proceedings of USENIX 1996 Annual Technical Conference.
|
| |
14
|
Hammer, J., Garcia-Molina, H., Widom, J., Labio, W. J., and Zhuge, Y. 1995. The Stanford data warehousing project. IEEE Data Eng. Bull. 18, 2 (June), 40--47.
|
 |
15
|
Venky Harinarayan , Anand Rajaraman , Jeffrey D. Ullman, Implementing data cubes efficiently, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.205-216, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
16
|
Lee, P. M. 1997. Bayesian Statistics: An Introduction, 2nd ed. Arnold.
|
| |
17
|
Matloff, N. 2002. Estimation of internet file-access/modification rates from incomplete data. Tech. Rep., University of California, Davis, Calif.
|
| |
18
|
Misra, P. and Sorenson, H. 1975. Parameter estimation in Poisson processes. IEEE Trans. Inf. Theory IT-21, 87--90.
|
| |
19
|
Snyder, D. L. 1975. Random Point Processes. Wiley, New York.
|
| |
20
|
Taylor, H. M. and Karlin, S. 1998. An Introduction to Stochastic Modeling, 3rd ed. Academic Press, Orlando, Fla.
|
| |
21
|
Thomas, Jr., G. B. 1969. Calculus and Analytic Geometry, 4th ed. Addison-Wesley, Reading, Mass.
|
| |
22
|
Wackerly, D. D., Mendenhall, W., and Scheaffer, R. L. 1997. Mathematical Statistics with Applications, 5th ed. PWS Publishing.
|
| |
23
|
|
| |
24
|
Winkler, R. L. 1972. An Introduction to Bayesian Inference and Decision. Holt, Rinehart and Winston, Inc.
|
 |
25
|
Alec Wolman , M. Voelker , Nitin Sharma , Neal Cardwell , Anna Karlin , Henry M. Levy, On the scale and performance of cooperative Web proxy caching, Proceedings of the seventeenth ACM symposium on Operating systems principles, p.16-31, December 12-15, 1999, Charleston, South Carolina, United States
|
| |
26
|
|
 |
27
|
Haobo Yu , Lee Breslau , Scott Shenker, A scalable Web cache consistency architecture, Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication, p.163-174, August 30-September 03, 1999, Cambridge, Massachusetts, United States
|
 |
28
|
Yue Zhuge , Héctor García-Molina , Joachim Hammer , Jennifer Widom, View maintenance in a warehousing environment, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.316-327, May 22-25, 1995, San Jose, California, United States
|
CITED BY 28
|
|
Luciano Barbosa , Ana Carolina Salgado , Francisco de Carvalho , Jacques Robin , Juliana Freire, Looking at both the present and the past to efficiently update replicas of web content, Proceedings of the 7th annual ACM international workshop on Web information and data management, November 04-04, 2005, Bremen, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Junghoo Cho , Hector Garcia-Molina , Taher Haveliwala , Wang Lam , Andreas Paepcke , Sriram Raghavan , Gary Wesley, Stanford WebBase components and applications, ACM Transactions on Internet Technology (TOIT), v.6 n.2, p.153-186, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Michael L. Nelson , Joan A. Smith , Ignacio Garcia del Campo, Efficient, automatic web resource harvesting, Proceedings of the eighth ACM international workshop on Web information and data management, November 10-10, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Curtis Dyreson , Richard T. Snodgrass , Faiz Currim , Sabah Currim , Shailesh Joshi, Weaving temporal and reliability aspects into a schema tapestry, Data & Knowledge Engineering, v.63 n.3, p.752-773, December, 2007
|
|
|
Richard T. Snodgrass , Curtis Dyreson , Faiz Currim , Sabah Currim , Shailesh Joshi, Validating quicksand: Temporal schema versioning in τXSchema, Data & Knowledge Engineering, v.65 n.2, p.223-242, May, 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Marc Spaniol , Dimitar Denev , Arturas Mazeika , Gerhard Weikum , Pierre Senellart, Data quality in web archiving, Proceedings of the 3rd workshop on Information credibility on the web, April 20-20, 2009, Madrid, Spain
|
|
|
|
|