|
ABSTRACT
During the last few years, several studies on the characterization of the public Web space of various national domains have been published. The pages of a country are an interesting set for studying the characteristics of the Web because at the same time these are diverse (as they are written by several authors) and yet rather similar (as they share a common geographical, historical and cultural context). This article discusses the methodologies used for presenting the results of Web characterization studies, including the granularity at which different aspects are presented, and a separation of concerns between contents, links, and technologies. Based on this, we present a side-by-side comparison of the results of 12 Web characterization studies, comprising over 120 million pages from 24 countries. The comparison unveils similarities and differences between the collections and sheds light on how certain results of a single Web characterization study on a sample may be valid in the context of the full Web.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alonso, J. L., Figuerola, C. G., and Zazo, Á. F. 2003. Cibermetría: Nuevas Técnicas de Estudio Aplicables al Web. Ediciones TREA, Spain.
|
 |
2
|
|
| |
3
|
Baeza-Yates, R. and Castillo, C. 2000. Caracterizando la Web chilena. In Encuentro Chileno de Ciencias de la Computación. Sociedad Chilena de Ciencias de la Computación, Punta Arenas, Chile.
|
| |
4
|
Baeza-Yates, R. and Castillo, C. 2001. Relating Web characteristics with link-based Web page ranking. In Proceedings of String Processing and Information Retrieval (SPIRE). IEEE Computer Society Press, 21--32.
|
| |
5
|
Baeza-Yates, R. and Castillo, C. 2002. Balancing volume, quality and freshness in Web crawling. In Soft Computing Systems---Design, Management and Applications. IOS Press Amsterdam, 565--572.
|
| |
6
|
Baeza-Yates, R. and Castillo, C. 2004. Crawling the infinite Web: Five levels are enough. In Proceedings of the 3rd Workshop on Web Graphs (WAW). Lecture Notes in Computer Science, vol. 3243. Springer, 156--167.
|
| |
7
|
Baeza-Yates, R. and Castillo, C. 2005. Características de la Web chilena 2004. Tech. rep., Center for Web Research, University of Chile.
|
| |
8
|
Baeza-Yates, R., Castillo, C., and Lopez, V. 2006. Características de la Web de Espaa. El Profesional de la Informacin 15, 1 (Jan.).
|
| |
9
|
Baeza-Yates, R. and Lalanne, F. 2004. Characteristics of the Korean Web. Tech. rep., Korea--Chile IT Cooperation Center (ITCC).
|
| |
10
|
Baeza-Yates, R. and Navarro, G. 2004. Modeling text collections and its application to the Web. In Applied Probability: Recent Advances, Kluwer Academic Publishing.
|
| |
11
|
|
| |
12
|
Baeza-Yates, R., Poblete, B., and Saint-Jean, F. 2003. Evolución de la Web Chilena 2001--2002. Tech. rep., Center for Web Research, University of Chile.
|
| |
13
|
Barr, D. 1996. RFC 1912: Common DNS operational and configuration errors. http://www.ietf.org/rfc/rfc1912.txt.
|
| |
14
|
|
| |
15
|
|
| |
16
|
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2002. Structural properties of the African Web. In Proceedings of the 11th International Conference on World Wide Web. ACM Press.
|
| |
17
|
|
| |
18
|
|
| |
19
|
Brin, S., Motwani, R., Page, L., and Winograd, T. 1998. What can you do with a Web in your pocket? IEEE Data Engin. Bull. 21, 2, 37--47.
|
| |
20
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.309-320, June 2000, Amsterdam, The Netherlands
|
| |
21
|
Cavnar, W. B. and Trenkle, J. M. 1994. N-gram-based text categorization. In Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR' 94). 161--175.
|
| |
22
|
Altigran S. da Silva , Eveline A. Veloso , Paulo B. Golghe , Berthier Ribeiro-Neto , Alberto H. F. Laender , Nivio Ziviani, CoBWeb A Crawler for the Brazilian Web, Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware, p.184, September 21-24, 1999
|
 |
23
|
Stephen Dill , Ravi Kumar , Kevin S. Mccurley , Sridhar Rajagopalan , D. Sivakumar , Andrew Tomkins, Self-similarity in the web, ACM Transactions on Internet Technology (TOIT), v.2 n.3, p.205-223, August 2002
[doi> 10.1145/572326.572328]
|
| |
24
|
|
| |
25
|
Efthimiadis, E. and Castillo, C. 2004. Charting the Greek Web. In Proceedings of the Conference of the American Society for Information Science and Technology (ASIST). American Society for Information Science and Technology.
|
 |
26
|
|
 |
27
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
 |
28
|
|
| |
29
|
Grefenstette, G. and Nioche, J. 2000. Estimation of english and non-english language use on the www. In Proceedings of Content-Based Multimedia Information Access (RIAO). 237--246.
|
| |
30
|
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web.
|
| |
31
|
|
| |
32
|
Huberman, B. A. and Adamic, L. A. 1999. Growth dynamics of the World-Wide Web. Nature 399.
|
| |
33
|
Jaimes, A., Ruiz, Verschae, R., Baeza-Yates, R., Castillo, C., Yaksic, D., and Davis, E. 2004. On the image content of a Web segment: Chile as a case study. J. Web Engin. 3, 2, 153--168.
|
 |
34
|
|
| |
35
|
Kleinberg, J. M., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. S. 1999. The Web as a graph: Measurements, models and methods. In Proceedings of the 5th Annual International Computing and Combinatorics Conference (COCOON). Lecture Notes in Computer Science, vol. 1627. Springer, 1--18.
|
| |
36
|
Mitzenmacher, M. 2003. Dynamic models for file sizes and double Pareto distributions. Intern. Mathe. 1, 3, 305--333.
|
| |
37
|
Modesto, M., Pereira, Ä., Ziviani, N., Castillo, C., and Baeza-Yates, R. 2005. Um novo retrato da Web Brasileira. In Proceedings of 32nd SEMISH. So Leopoldo, Brazil, 2005--2017.
|
| |
38
|
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the Web. Tech. rep., Stanford Digital Library Technologies Project.
|
| |
39
|
|
| |
40
|
|
| |
41
|
Rauber, A., Aschenbrenner, A., Witvoet, O., Bruckner, R. M., and Kaiser, M. 2002. Uncovering information hidden in Web archives. D-Lib Magazine 8, 12.
|
| |
42
|
Sanguanpong, S., Nga, P. P., Keretho, S., Poovarawan, Y., and Warangrit, S. 2000. Measuring and analysis of the Thai World Wide Web. In Proceeding of the Asia Pacific Advance Network Conference. Beijing, China, 225--230.
|
| |
43
|
Sanguanpong, S. and Warangrit, S. 1998. Nontrisearch: Search engine for campus network. In National Computer Science and Engineering Conference. Bangkok, Thailand.
|
| |
44
|
|
| |
45
|
Veloso, E. A., de Moura, E., Golgher, P., da Silva, A., Almeida, R., Laender, A., Neto, R. B., and Ziviani, N. 2000. Um retrato da Web Brasileira. In Proceedings of Simposio Brasileiro de Computacao. Curitiba, Brasil.
|
 |
46
|
Ziv Bar-Yossef , Andrei Z. Broder , Ravi Kumar , Andrew Tomkins, Sic transit gloria telae: towards an understanding of the web's decay, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988716]
|
| |
47
|
Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Cambridge, MA.
|
CITED BY 6
|
|
|
|
|
|
|
|
|
|
|
Sofia Stamou , Lefteris Kozanidis , Paraskevi Tzekou , Nikos Zotos, Query selection for improved Greek web searches, Proceeding of the 2nd ACM workshop on Improving non english web searching, October 30-30, 2008, Napa Valley, California, USA
|
|
|
|