ACM Home Page
Please provide us with feedback. Feedback
Characterizing a national community web
Full text PdfPdf (365 KB)
Source ACM Transactions on Internet Technology (TOIT) archive
Volume 5 ,  Issue 3  (August 2005) table of contents
Pages: 508 - 531  
Year of Publication: 2005
ISSN:1533-5399
Authors
Daniel Gomes  University of Lisbon, Lisboa, Portugal
Mário J. Silva  University of Lisbon, Lisboa, Portugal
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 83,   Citation Count: 7
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1084772.1084775
What is a DOI?

ABSTRACT

This article presents a characterization of the community Web of the people of Portugal. We defined criteria for delimiting this Web based on our past experience of crawling pages related to Portugal and collected over 3.2 million documents from 46,000 sites satisfying those criteria. Our characterization was derived from this crawl. We describe the rules that we established for defining the boundaries of this community Web and the methodology used to gather statistics. Statistics cover the number and domain distribution of sites; the number, type and size distribution of text documents; and the linkage structure of this Web. We also show how crawling constraints and abnormal situations on the Web can influence the statistics.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Aires, R. and Santos, D. 2002. Measuring the Web in Portuguese. In Proceedings of the Euroweb Conference. B. Matthews, B. Hopgood, and M. Wilson, Eds. Oxford, UK, 198--199.
 
2
Albertsen, K. 2003. The paradigma Web harvesting environment. In Proceedings of 3rd ECDL Workshop on Web Archives. Trondheim, Norway.
 
3
Barr, D. 1996. RFC 1912. IETF.
 
4
 
5
 
6
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2002. Structural properties of the African Web. In Proceedings of the 11th International World Wide Web Conference. Honolulu, Hawaii.
 
7
 
8
 
9
 
10
Cavnar, W. and Trenkle, J. 1994. N-gram-based text categorization. In the 3rd Annual Symposium on Document Analysis and Information Retrieval. 161--175.
 
11
Center, H. S. D. 2003. Geo targeting IP address to country city region ISP latitude longitude database for Internet developers---ip2location. Available at http://www.ip2location.com/.
 
12
 
13
 
14
Davis, C., Vixie, P., Goodwin, T., and Dickinson, I. 1996. A means for expressing location information in the domain name system. RFC 1876. IETF.
 
15
Day, M. 2003. Collecting and preserving the World Wide Web. Available at http://www.jisc.ac. uk/uploaded_documents/archiving_feasibility.pdf.
 
16
Douglis, F., Feldmann, A., Krishnamurthy, B., and Mogul, J. C. 1997. Rate of change and other metrics: A live study of the World Wide Web. In the USENIX Symposium on Internet Technologies and Systems.
17
18
 
19
Funredes. 2001. The place of latin languages on the Internet. Available at http://funredes.org/lc
20
 
21
Gomes, D. 2003. Vúva negra. Available at www.tumba.pt/english/crawler.html.
 
22
Google. 2003. Google Web search features. Available at www.google.com/help/features.html#link.
 
23
Grefenstette, G. and Nioche, J. 2000. Estimation of english and non-english language use on the WWW. In Proceedings of RIAO'2000---Content-Based Multimedia Information Access. Paris, France. 237--246.
 
24
Harrenstien, K., Stahl, M. K., and Feinler, E. J. 1985. NICNAME/WHOIS. RFC 954. IETF.
 
25
Henzinger, M. 2003. Algorithmic challenges in Web search engines. J. Internet Math. 1, 1, 115--126.
 
26
27
 
28
Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109.
 
29
Leung, S.-T. A., Perl, S. E., Stata, R., and Wiener, J. L. 2001. Towards Web-scale Web archeology. Tech. rep. 174, (Sept.) Compaq Research Center, Paolo Alto CA.
 
30
LLC, M. 2003. Maxmind: How to locate your Internet visitors geotargeting IP address to country state city ISP organization latitude longitude. Available at http://www.maxmind.com/.
 
31
Marktest. 2003. Netpanel. Available at netpanel.marktest.pt/.
 
32
Mogul, J. 1999a. A trace-based analysis of duplicate suppression in HTTP. Tech. rep. 99/2, (Nov.) Compaq Computer Corporation, Western Research Laboratory.
 
33
Mogul, J. 1999b. Errors in timestamp-based HTTP header values. Tech. rep. 99/3, (Dec.) Compaq Computer Corporation, Western Research Laboratory.
 
34
Najork, M. and Heydon, A. 2001. On high-performance Web crawling. SR, Tech A68 Compaq Research Center, Palo Alto, CA.
 
35
Netcraft Ltd. 2004. Netcraft: April 2003 archives. Available at http://news.netcraft.com/archives/2003/04/index.html.
 
36
Nicolau, M. J., Macedo, J., and Costa, A. 1997. Caracterização da informação WWW na RCCN. Tech. Rep., Universidade do Minho, Portugal.
 
37
 
38
OCLC. 2003. Web characterization. Available at http://wcp.oclc.org/.
 
39
O'Neill, E. T. 1999. Web sites: Concepts, issues, and definitions. Available at http://wcp.oclc.org/pubs/rn1-websites.html.
 
40
O'Neill, E. T., Lavoie, B. F., and Bennett, R. 2003. Trends in the evolution of the public Web. D-Lib Magazine 9, 4 (April).
 
41
Overture Services, I. 2003. Alltheweb.com: Frequently asked questions---URL investigator. Available at www.alltheweb.com/help/faqs/url_investigator.
 
42
 
43
 
44
Postel, J. 1994. Domain name system structure and delegation. RFC 1591. IETF.
 
45
Punpiti, S. S. 2000. Measuring and analysis of the Thai World Wide Web. In Proceedings of the Asia Pacific Advance Network. 225--230.
 
46
Rivest, R. 1992. The MD5 message-digest algorithm. RFC 1321. IETF.
 
47
 
48
Silva, L. O., Macedo, J., Costa, A., Belo, O., and Santos, A. 2002a. Netcensus: Medição da evolução dos conteúdos na web. Tech. rep. Departamento de Informática, Universidade do Minho, Portugal.
 
49
Silva, L. O., Macedo, J., Costa, A., Belo, O., and Santos, A. 2002b. Obtenção de estatísticas do www em Portugal. Tech. rep. Universidade do Minho, Portugal.
 
50
Silva, M. J. 2003. The case for a portuguese Web search engine. In Proceedings of IADIS International Conference WWW/Internet. Algarve, Portugal.
51
 
52
W3C. 1999. HTML 4.01 specification. Available at http://www.w3.org/TR/html401/.
 
53
W3C. 1999. Web characterization terminology and definitions sheet. Available at http://www.w3.org/1999/05/WCA-terms/.
 
54
Webb, C. 2000. Towards a preserved national collection of selected Australian digital publications. In Proceedings of the Preservation Conference. York, UK.
 
55
 
56
Zabicka, P. 2003. Archiving the Czech Web: Issues and challenges. In Proceedings of the 3rd ECDL Workshop on Web Archives. Trondheim, Norway.
 
57
Zook, M. 2000. Internet metrics: Using host and domain counts to map the Internet. Telecomm. Policy, 24, 6/7, 613--620.


Collaborative Colleagues:
Daniel Gomes: colleagues
Mário J. Silva: colleagues