ACM Home Page
Please provide us with feedback. Feedback
Untangling compound documents on the web
Full text PdfPdf (193 KB)
Source Conference on Hypertext and Hypermedia archive
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia table of contents
Nottingham, UK
SESSION: Link aggregation table of contents
Pages: 85 - 94  
Year of Publication: 2003
ISBN:1-58113-704-4
Authors
Nadav Eiron  IBM Almaden Research Center, San Jose, CA
Kevin S. McCurley  IBM Almaden Research Center, San Jose, CA
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 31,   Citation Count: 12
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/900051.900070
What is a DOI?

ABSTRACT

Most text analysis is designed to deal with the concept of a "document", namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of "document" and "web node" are not synonymous, and that authors often tend to deploy documents as collections of URLs, which we call "compound documents". In this paper we present new techniques for identifying and working with such compound documents, and the results of some large-scale studies on such web documents. The primary motivation for this work stems from the fact that information retrieval techniques are better suited to working on documents than individual hypertext nodes.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
HTML 4.01 specification, W3C recommendation. http://www.w3.org/TR/REC-html40/struct/links.html, December 1999.
 
2
Lada~A. Adamic and Eytan Adar. Friends and neighbors on the web. http://www.hpl.hp.com/shl/people/eytan/fnn.pdf.
3
 
4
T. Berners-Lee, R. Fielding, and L. Masinter. Uniform resource identifiers (URI): Generic syntax. http://www.ietf.org/rfc/rfc2396.txt. RFC 2396.
 
5
Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Scientific American, May 2001. See http://www.sciam.com/2001/0501issue/0501berners-lee.html.
6
 
7
Vannevar Bush. As we may think. The Atlantic Monthly, 176(1):101--108, July 1945.
 
8
Michael Chen, Marti~A. Hearst, Jason Hong, and James Lin. Cha-cha: A system for organizing intranet search results. In USENIX Symposium on Internet Technologies and Systems, 1999.
 
9
R. C. Daley and P. G. Neumann. A general-purpose file system for secondary storage. In AFIPS Conference Proceedings, volume 27, pages 213--229, 1965. http://www.multicians.org/fjcc4.html.
10
11
 
12
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14, 1963.
 
13
 
14
15
 
16
Theodor~Holm Nelson. Embedded markup considered harmful. http://www.xml.com/pub/a/w3j/s3.nelson.html, October 1997.
 
17
David M. Pennock, Gary W. Flake, Steve Lawrence, Eric J. Glover, and C. Les Giles. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Science, 99(8):5207--5211, April 16 2002.
 
18
Craig Silverstein, Monika Henzinger, Hannes Marais, and Michael Moricz. Analysis of a very large altavista query log. Technical report, DEC Systems Research Center, 1998. Technical note 1998-14.
 
19
H. A. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):425--440, 1955.
 
20
H. G. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of American Society for Information Science, 24(4):265--269, 1973.
 
21
 
22
 
23
Keishi Tajima, Kenji Hatano, Takeshi Matsukura, Ryouichi Sano, and Katsumi Tanaka. Discovery and retrieval of logical information units in web. In Proceedings of the Workshop on Organizing Web Space (WOWS 99), pages 13--23, Berkeley, CA, August 1999.
 
24
Duncan J. Watts and Steven H. Strogatz. Collective dynamics of "small-world networks". Nature, 393:440--442, June 4 1998.
25

CITED BY  12

Collaborative Colleagues:
Nadav Eiron: colleagues
Kevin S. McCurley: colleagues