|
ABSTRACT
Most text analysis is designed to deal with the concept of a "document", namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of "document" and "web node" are not synonymous, and that authors often tend to deploy documents as collections of URLs, which we call "compound documents". In this paper we present new techniques for identifying and working with such compound documents, and the results of some large-scale studies on such web documents. The primary motivation for this work stems from the fact that information retrieval techniques are better suited to working on documents than individual hypertext nodes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
HTML 4.01 specification, W3C recommendation. http://www.w3.org/TR/REC-html40/struct/links.html, December 1999.
|
| |
2
|
Lada~A. Adamic and Eytan Adar. Friends and neighbors on the web. http://www.hpl.hp.com/shl/people/eytan/fnn.pdf.
|
 |
3
|
|
| |
4
|
T. Berners-Lee, R. Fielding, and L. Masinter. Uniform resource identifiers (URI): Generic syntax. http://www.ietf.org/rfc/rfc2396.txt. RFC 2396.
|
| |
5
|
Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Scientific American, May 2001. See http://www.sciam.com/2001/0501issue/0501berners-lee.html.
|
 |
6
|
|
| |
7
|
Vannevar Bush. As we may think. The Atlantic Monthly, 176(1):101--108, July 1945.
|
| |
8
|
Michael Chen, Marti~A. Hearst, Jason Hong, and James Lin. Cha-cha: A system for organizing intranet search results. In USENIX Symposium on Internet Technologies and Systems, 1999.
|
| |
9
|
R. C. Daley and P. G. Neumann. A general-purpose file system for secondary storage. In AFIPS Conference Proceedings, volume 27, pages 213--229, 1965. http://www.multicians.org/fjcc4.html.
|
 |
10
|
|
 |
11
|
|
| |
12
|
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14, 1963.
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
Theodor~Holm Nelson. Embedded markup considered harmful. http://www.xml.com/pub/a/w3j/s3.nelson.html, October 1997.
|
| |
17
|
David M. Pennock, Gary W. Flake, Steve Lawrence, Eric J. Glover, and C. Les Giles. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Science, 99(8):5207--5211, April 16 2002.
|
| |
18
|
Craig Silverstein, Monika Henzinger, Hannes Marais, and Michael Moricz. Analysis of a very large altavista query log. Technical report, DEC Systems Research Center, 1998. Technical note 1998-14.
|
| |
19
|
H. A. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):425--440, 1955.
|
| |
20
|
H. G. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of American Society for Information Science, 24(4):265--269, 1973.
|
| |
21
|
|
| |
22
|
|
| |
23
|
Keishi Tajima, Kenji Hatano, Takeshi Matsukura, Ryouichi Sano, and Katsumi Tanaka. Discovery and retrieval of logical information units in web. In Proceedings of the Workshop on Organizing Web Space (WOWS 99), pages 13--23, Berkeley, CA, August 1999.
|
| |
24
|
Duncan J. Watts and Steven H. Strogatz. Collective dynamics of "small-world networks". Nature, 393:440--442, June 4 1998.
|
 |
25
|
Ron Weiss , Bienvenido Vélez , Mark A. Sheldon, HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering, Proceedings of the the seventh ACM conference on Hypertext, p.180-193, March 16-20, 1996, Bethesda, Maryland, United States
[doi> 10.1145/234828.234846]
|
CITED BY 12
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
André Luiz da Costa Carvalho , Paul - Alexandru Chirita , Edleno Silva de Moura , Pável Calado , Wolfgang Nejdl, Site level noise removal for search engines, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
|
|
|
|
|
|
|
|
|
|
|
|
Steven Bethard , Philipp Wetzer , Kirsten Butcher , James H. Martin , Tamara Sumner, Automatically characterizing resource quality for educational digital libraries, Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, June 15-19, 2009, Austin, TX, USA
|
|
|
Philipp G. Wetzler , Steven Bethard , Kirsten Butcher , James H. Martin , Tamara Sumner, Automatically assessing resource quality for educational digital libraries, Proceedings of the 3rd workshop on Information credibility on the web, April 20-20, 2009, Madrid, Spain
|
|