|
ABSTRACT
This article presents a characterization of the community Web of the people of Portugal. We defined criteria for delimiting this Web based on our past experience of crawling pages related to Portugal and collected over 3.2 million documents from 46,000 sites satisfying those criteria. Our characterization was derived from this crawl. We describe the rules that we established for defining the boundaries of this community Web and the methodology used to gather statistics. Statistics cover the number and domain distribution of sites; the number, type and size distribution of text documents; and the linkage structure of this Web. We also show how crawling constraints and abnormal situations on the Web can influence the statistics.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Aires, R. and Santos, D. 2002. Measuring the Web in Portuguese. In Proceedings of the Euroweb Conference. B. Matthews, B. Hopgood, and M. Wilson, Eds. Oxford, UK, 198--199.
|
| |
2
|
Albertsen, K. 2003. The paradigma Web harvesting environment. In Proceedings of 3rd ECDL Workshop on Web Archives. Trondheim, Norway.
|
| |
3
|
Barr, D. 1996. RFC 1912. IETF.
|
| |
4
|
|
| |
5
|
|
| |
6
|
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2002. Structural properties of the African Web. In Proceedings of the 11th International World Wide Web Conference. Honolulu, Hawaii.
|
| |
7
|
|
| |
8
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.309-320, June 2000, Amsterdam, The Netherlands
|
| |
9
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
10
|
Cavnar, W. and Trenkle, J. 1994. N-gram-based text categorization. In the 3rd Annual Symposium on Document Analysis and Information Retrieval. 161--175.
|
| |
11
|
Center, H. S. D. 2003. Geo targeting IP address to country city region ISP latitude longitude database for Internet developers---ip2location. Available at http://www.ip2location.com/.
|
| |
12
|
|
| |
13
|
|
| |
14
|
Davis, C., Vixie, P., Goodwin, T., and Dickinson, I. 1996. A means for expressing location information in the domain name system. RFC 1876. IETF.
|
| |
15
|
Day, M. 2003. Collecting and preserving the World Wide Web. Available at http://www.jisc.ac. uk/uploaded_documents/archiving_feasibility.pdf.
|
| |
16
|
Douglis, F., Feldmann, A., Krishnamurthy, B., and Mogul, J. C. 1997. Rate of change and other metrics: A live study of the World Wide Web. In the USENIX Symposium on Internet Technologies and Systems.
|
 |
17
|
|
 |
18
|
Gary William Flake , Steve Lawrence , C. Lee Giles, Efficient identification of Web communities, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.150-160, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347121]
|
| |
19
|
Funredes. 2001. The place of latin languages on the Internet. Available at http://funredes.org/lc
|
 |
20
|
David Gibson , Jon Kleinberg , Prabhakar Raghavan, Inferring Web communities from link topology, Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems, p.225-234, June 20-24, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276627.276652]
|
| |
21
|
Gomes, D. 2003. Vúva negra. Available at www.tumba.pt/english/crawler.html.
|
| |
22
|
Google. 2003. Google Web search features. Available at www.google.com/help/features.html#link.
|
| |
23
|
Grefenstette, G. and Nioche, J. 2000. Estimation of english and non-english language use on the WWW. In Proceedings of RIAO'2000---Content-Based Multimedia Information Access. Paris, France. 237--246.
|
| |
24
|
Harrenstien, K., Stahl, M. K., and Feinler, E. J. 1985. NICNAME/WHOIS. RFC 954. IETF.
|
| |
25
|
Henzinger, M. 2003. Algorithmic challenges in Web search engines. J. Internet Math. 1, 1, 115--126.
|
| |
26
|
|
 |
27
|
|
| |
28
|
Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109.
|
| |
29
|
Leung, S.-T. A., Perl, S. E., Stata, R., and Wiener, J. L. 2001. Towards Web-scale Web archeology. Tech. rep. 174, (Sept.) Compaq Research Center, Paolo Alto CA.
|
| |
30
|
LLC, M. 2003. Maxmind: How to locate your Internet visitors geotargeting IP address to country state city ISP organization latitude longitude. Available at http://www.maxmind.com/.
|
| |
31
|
Marktest. 2003. Netpanel. Available at netpanel.marktest.pt/.
|
| |
32
|
Mogul, J. 1999a. A trace-based analysis of duplicate suppression in HTTP. Tech. rep. 99/2, (Nov.) Compaq Computer Corporation, Western Research Laboratory.
|
| |
33
|
Mogul, J. 1999b. Errors in timestamp-based HTTP header values. Tech. rep. 99/3, (Dec.) Compaq Computer Corporation, Western Research Laboratory.
|
| |
34
|
Najork, M. and Heydon, A. 2001. On high-performance Web crawling. SR, Tech A68 Compaq Research Center, Palo Alto, CA.
|
| |
35
|
Netcraft Ltd. 2004. Netcraft: April 2003 archives. Available at http://news.netcraft.com/archives/2003/04/index.html.
|
| |
36
|
Nicolau, M. J., Macedo, J., and Costa, A. 1997. Caracterização da informação WWW na RCCN. Tech. Rep., Universidade do Minho, Portugal.
|
| |
37
|
|
| |
38
|
OCLC. 2003. Web characterization. Available at http://wcp.oclc.org/.
|
| |
39
|
O'Neill, E. T. 1999. Web sites: Concepts, issues, and definitions. Available at http://wcp.oclc.org/pubs/rn1-websites.html.
|
| |
40
|
O'Neill, E. T., Lavoie, B. F., and Bennett, R. 2003. Trends in the evolution of the public Web. D-Lib Magazine 9, 4 (April).
|
| |
41
|
Overture Services, I. 2003. Alltheweb.com: Frequently asked questions---URL investigator. Available at www.alltheweb.com/help/faqs/url_investigator.
|
| |
42
|
|
| |
43
|
|
| |
44
|
Postel, J. 1994. Domain name system structure and delegation. RFC 1591. IETF.
|
| |
45
|
Punpiti, S. S. 2000. Measuring and analysis of the Thai World Wide Web. In Proceedings of the Asia Pacific Advance Network. 225--230.
|
| |
46
|
Rivest, R. 1992. The MD5 message-digest algorithm. RFC 1321. IETF.
|
| |
47
|
|
| |
48
|
Silva, L. O., Macedo, J., Costa, A., Belo, O., and Santos, A. 2002a. Netcensus: Medição da evolução dos conteúdos na web. Tech. rep. Departamento de Informática, Universidade do Minho, Portugal.
|
| |
49
|
Silva, L. O., Macedo, J., Costa, A., Belo, O., and Santos, A. 2002b. Obtenção de estatísticas do www em Portugal. Tech. rep. Universidade do Minho, Portugal.
|
| |
50
|
Silva, M. J. 2003. The case for a portuguese Web search engine. In Proceedings of IADIS International Conference WWW/Internet. Algarve, Portugal.
|
 |
51
|
|
| |
52
|
W3C. 1999. HTML 4.01 specification. Available at http://www.w3.org/TR/html401/.
|
| |
53
|
W3C. 1999. Web characterization terminology and definitions sheet. Available at http://www.w3.org/1999/05/WCA-terms/.
|
| |
54
|
Webb, C. 2000. Towards a preserved national collection of selected Australian digital publications. In Proceedings of the Preservation Conference. York, UK.
|
| |
55
|
|
| |
56
|
Zabicka, P. 2003. Archiving the Czech Web: Issues and challenges. In Proceedings of the 3rd ECDL Workshop on Web Archives. Trondheim, Norway.
|
| |
57
|
Zook, M. 2000. Internet metrics: Using host and domain counts to map the Internet. Telecomm. Policy, 24, 6/7, 613--620.
|
|