ACM Home Page
Please provide us with feedback. Feedback
Characterization of national Web domains
Full text PdfPdf (1.41 MB)
Source
ACM Transactions on Internet Technology (TOIT) archive
Volume 7 ,  Issue 2  (May 2007) table of contents
Article No. 9  
Year of Publication: 2007
ISSN:1533-5399
Authors
Ricardo Baeza-Yates  Yahoo! Research
Carlos Castillo  Cátedra Telefónica, Universitat Pompeu Fabra
Efthimis N. Efthimiadis  University of Washington
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 178,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1239971.1239973
What is a DOI?

ABSTRACT

During the last few years, several studies on the characterization of the public Web space of various national domains have been published. The pages of a country are an interesting set for studying the characteristics of the Web because at the same time these are diverse (as they are written by several authors) and yet rather similar (as they share a common geographical, historical and cultural context).

This article discusses the methodologies used for presenting the results of Web characterization studies, including the granularity at which different aspects are presented, and a separation of concerns between contents, links, and technologies. Based on this, we present a side-by-side comparison of the results of 12 Web characterization studies, comprising over 120 million pages from 24 countries. The comparison unveils similarities and differences between the collections and sheds light on how certain results of a single Web characterization study on a sample may be valid in the context of the full Web.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Alonso, J. L., Figuerola, C. G., and Zazo, Á. F. 2003. Cibermetría: Nuevas Técnicas de Estudio Aplicables al Web. Ediciones TREA, Spain.
2
 
3
Baeza-Yates, R. and Castillo, C. 2000. Caracterizando la Web chilena. In Encuentro Chileno de Ciencias de la Computación. Sociedad Chilena de Ciencias de la Computación, Punta Arenas, Chile.
 
4
Baeza-Yates, R. and Castillo, C. 2001. Relating Web characteristics with link-based Web page ranking. In Proceedings of String Processing and Information Retrieval (SPIRE). IEEE Computer Society Press, 21--32.
 
5
Baeza-Yates, R. and Castillo, C. 2002. Balancing volume, quality and freshness in Web crawling. In Soft Computing Systems---Design, Management and Applications. IOS Press Amsterdam, 565--572.
 
6
Baeza-Yates, R. and Castillo, C. 2004. Crawling the infinite Web: Five levels are enough. In Proceedings of the 3rd Workshop on Web Graphs (WAW). Lecture Notes in Computer Science, vol. 3243. Springer, 156--167.
 
7
Baeza-Yates, R. and Castillo, C. 2005. Características de la Web chilena 2004. Tech. rep., Center for Web Research, University of Chile.
 
8
Baeza-Yates, R., Castillo, C., and Lopez, V. 2006. Características de la Web de Espaa. El Profesional de la Informacin 15, 1 (Jan.).
 
9
Baeza-Yates, R. and Lalanne, F. 2004. Characteristics of the Korean Web. Tech. rep., Korea--Chile IT Cooperation Center (ITCC).
 
10
Baeza-Yates, R. and Navarro, G. 2004. Modeling text collections and its application to the Web. In Applied Probability: Recent Advances, Kluwer Academic Publishing.
 
11
 
12
Baeza-Yates, R., Poblete, B., and Saint-Jean, F. 2003. Evolución de la Web Chilena 2001--2002. Tech. rep., Center for Web Research, University of Chile.
 
13
Barr, D. 1996. RFC 1912: Common DNS operational and configuration errors. http://www.ietf.org/rfc/rfc1912.txt.
 
14
 
15
 
16
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2002. Structural properties of the African Web. In Proceedings of the 11th International Conference on World Wide Web. ACM Press.
 
17
 
18
 
19
Brin, S., Motwani, R., Page, L., and Winograd, T. 1998. What can you do with a Web in your pocket? IEEE Data Engin. Bull. 21, 2, 37--47.
 
20
 
21
Cavnar, W. B. and Trenkle, J. M. 1994. N-gram-based text categorization. In Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR' 94). 161--175.
 
22
23
 
24
 
25
Efthimiadis, E. and Castillo, C. 2004. Charting the Greek Web. In Proceedings of the Conference of the American Society for Information Science and Technology (ASIST). American Society for Information Science and Technology.
26
27
28
 
29
Grefenstette, G. and Nioche, J. 2000. Estimation of english and non-english language use on the www. In Proceedings of Content-Based Multimedia Information Access (RIAO). 237--246.
 
30
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web.
 
31
 
32
Huberman, B. A. and Adamic, L. A. 1999. Growth dynamics of the World-Wide Web. Nature 399.
 
33
Jaimes, A., Ruiz, Verschae, R., Baeza-Yates, R., Castillo, C., Yaksic, D., and Davis, E. 2004. On the image content of a Web segment: Chile as a case study. J. Web Engin. 3, 2, 153--168.
34
 
35
Kleinberg, J. M., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. S. 1999. The Web as a graph: Measurements, models and methods. In Proceedings of the 5th Annual International Computing and Combinatorics Conference (COCOON). Lecture Notes in Computer Science, vol. 1627. Springer, 1--18.
 
36
Mitzenmacher, M. 2003. Dynamic models for file sizes and double Pareto distributions. Intern. Mathe. 1, 3, 305--333.
 
37
Modesto, M., Pereira, Ä., Ziviani, N., Castillo, C., and Baeza-Yates, R. 2005. Um novo retrato da Web Brasileira. In Proceedings of 32nd SEMISH. So Leopoldo, Brazil, 2005--2017.
 
38
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the Web. Tech. rep., Stanford Digital Library Technologies Project.
 
39
 
40
 
41
Rauber, A., Aschenbrenner, A., Witvoet, O., Bruckner, R. M., and Kaiser, M. 2002. Uncovering information hidden in Web archives. D-Lib Magazine 8, 12.
 
42
Sanguanpong, S., Nga, P. P., Keretho, S., Poovarawan, Y., and Warangrit, S. 2000. Measuring and analysis of the Thai World Wide Web. In Proceeding of the Asia Pacific Advance Network Conference. Beijing, China, 225--230.
 
43
Sanguanpong, S. and Warangrit, S. 1998. Nontrisearch: Search engine for campus network. In National Computer Science and Engineering Conference. Bangkok, Thailand.
 
44
 
45
Veloso, E. A., de Moura, E., Golgher, P., da Silva, A., Almeida, R., Laender, A., Neto, R. B., and Ziviani, N. 2000. Um retrato da Web Brasileira. In Proceedings of Simposio Brasileiro de Computacao. Curitiba, Brasil.
46
 
47
Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Cambridge, MA.


Collaborative Colleagues:
Ricardo Baeza-Yates: colleagues
Carlos Castillo: colleagues
Efthimis N. Efthimiadis: colleagues