|
ABSTRACT
We formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. We show that the use of templates is pervasive on the web. We describe three principles, which characterize the assumptions made by hypertext information retrieval (IR) and data mining (DM) systems, and show that templates are a major source of violation of these principles. As a consequence, basic "pure" implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
5
|
V. Bush. As we may think. The Atlantic Monthly, 176(1):101--108, July 1945.
|
 |
6
|
|
| |
7
|
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text, Proceedings of the seventh international conference on World Wide Web 7, p.65-74, April 1998, Brisbane, Australia
|
| |
8
|
Soumen Chakrabarti , Byron E. Dom , David Gibson , Ravi Kumar , Prabhakar Raghavan , Sridhar Rajagopalan , Andrew Tomkins, Topic Distillation and Spectral Filtering, Artificial Intelligence Review, v.13 n.5-6, p.409-435, Dec. 1999
[doi> 10.1023/A:1006596506229]
|
 |
9
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
10
|
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text, Proceedings of the seventh international conference on World Wide Web 7, p.65-74, April 1998, Brisbane, Australia
|
 |
11
|
Soumen Chakrabarti , Mukul Joshi , Vivek Tawde, Enhanced topic distillation using text, markup tags, and hyperlinks, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.208-216, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383990]
|
| |
12
|
|
| |
13
|
|
| |
14
|
B. D. Davison. Recognizing nepotistic links on the web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23--28, 2000.
|
| |
15
|
|
| |
16
|
E. Garfield. "Citation Analysis as a Tool in Journal Evaluation". Science, 178:471--479, 1972.
|
| |
17
|
Google. http://www.google.com.
|
| |
18
|
M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10--25, 1963.
|
 |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998.
|
| |
24
|
G. Pinski and F. Narin. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Proc. and Management, 12, 1976.
|
 |
25
|
Peter Pirolli , James Pitkow , Ramana Rao, Silk from a sow's ear: extracting usable structures from the Web, Proceedings of the SIGCHI conference on Human factors in computing systems: common ground, p.118-125, April 13-18, 1996, Vancouver, British Columbia, Canada
[doi> 10.1145/238386.238450]
|
| |
26
|
H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265--269, 1973.
|
CITED BY 48
|
|
|
|
|
|
|
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
|
|
|
|
|
|
Lakshmish Ramaswamy , Arun Iyengar , Ling Liu , Fred Douglis, Techniques for efficient fragment detection in web pages, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
Lakshmish Ramaswamy , Arun Iyengar , Ling Liu , Fred Douglis, Automatic detection of fragments in dynamically generated web pages, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
Ling Ma , Nazli Goharian , Abdur Chowdhury , Misun Chung, Extracting unstructured data from template generated web documents, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
Ruihua Song , Haifeng Liu , Ji-Rong Wen , Wei-Ying Ma, Learning block importance models for web pages, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
Bambang Parmanto , Reza Ferrydiansyah , Andi Saptono , Lijing Song , I Wayan Sugiantara , Stephanie Hackett, AcceSS: accessibility through simplification & summarization, Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A), May 10-10, 2005, Chiba, Japan
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Karane Vieira , Altigran S. da Silva , Nick Pinto , Edleno S. de Moura , João M. B. Cavalcanti , Juliana Freire, A fast and robust method for web page template detection and removal, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Marcus Fontoura , Engene Shekita , Jason Y. Zien , Sridhar Rajagopalan , Andreas Neumann, High performance index build algorithms for intranet search engines, Proceedings of the Thirtieth international conference on Very large data bases, p.1122-1133, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yu Wang , Bingxing Fang , Xueqi Cheng , Li Guo , Hongbo Xu, Incremental web page template detection, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
David Fernandes , Edleno S. de Moura , Berthier Ribeiro-Neto , Altigran S. da Silva , Marcos André Gonçalves, Computing block importance for searching on web sites, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Karane Vieira , André Luiz Costa Carvalho , Klessius Berlt , Edleno S. Moura , Altigran S. Silva , Juliana Freire, On Finding Templates on Web Collections, World Wide Web, v.12 n.2, p.171-211, June 2009
|
|
|
|
|