|
ABSTRACT
With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end.In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of recent and ongoing research.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States
|
 |
2
|
|
| |
3
|
J. Allen. Natural Language Understanding. Benjamin/Cummings, 1987.
|
| |
4
|
D. J. Arnold, L. Balkan, R. L. Humphreys, S. Meijer, and L. Sadler. Machine translation: An introductory guide, 1995. Online at http://clwww.essex.ac.uk/~doug/book/book.html.
|
| |
5
|
Babelfish Language Translation Service. http://www.altavista.com, 1998.
|
| |
6
|
|
| |
7
|
Krishna Bharat , Andrei Broder , Monika Henzinger , Puneet Kumar , Suresh Venkatasubramanian, The connectivity server: fast access to linkage information on the Web, Proceedings of the seventh international conference on World Wide Web 7, p.469-477, April 1998, Brisbane, Australia
|
 |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
N. Catenazzi and F. Gibb. The publishing process: the hyperbook approach. Journal of Information Science, 21(3):161--172, 1995.
|
| |
13
|
Soumen Chakrabarti , Byron Dom , Rakesh Agrawal , Prabhakar Raghavan, Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases, Proceedings of the 23rd International Conference on Very Large Data Bases, p.446-455, August 25-29, 1997
|
| |
14
|
|
| |
15
|
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text, Proceedings of the seventh international conference on World Wide Web 7, p.65-74, April 1998, Brisbane, Australia
|
 |
16
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
17
|
Soumen Chakrabarti , Byron E. Dom , S. Ravi Kumar , Prabhakar Raghavan , Sridhar Rajagopalan , Andrew Tomkins , David Gibson , Jon Kleinberg, Mining the Web's Link Structure, Computer, v.32 n.8, p.60-67, August 1999
[doi> 10.1109/2.781636]
|
| |
18
|
|
| |
19
|
|
 |
20
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen, Constant interaction-time scatter/gather browsing of very large document collections, Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, p.126-134, June 27-July 01, 1993, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/160688.160706]
|
 |
21
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
| |
22
|
|
| |
23
|
S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6):391--407, 1990. Online at http://superbook.telcordia.com/~remde/isi/papers/JASIS90.ps.
|
| |
24
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B(39):1--38, 1977.
|
| |
25
|
|
| |
26
|
|
| |
27
|
R. Dude and P. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973.
|
| |
28
|
S. Fong and R. Berwick. Parsing with Principles and Parameters. MIT Press, 1992.
|
| |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
R. Goldman, J. McHugh, and J. Widom. From semistructured data to XML: Migrating the Lore data model and query language. In Proceedings of the 2nd International Workshop on the Web and Databases (WebDB '99), pages 25--30, Philadelphia, June 1999. Online at http://www-db.stanford.edu/pub/papers/xml.ps.
|
| |
34
|
G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, London, 1989.
|
| |
35
|
S. G. Green. Building newspaper links in newspaper articles using semantic similarity. In Natural Language and Data Bases Conference, Vancouver, NLDB'97, 1997.
|
| |
36
|
L. Haegeman. Introduction to Government and Binding Theory. Basil Blackwell Ltd., Oxford, 1991.
|
 |
37
|
|
| |
38
|
|
| |
39
|
W. J. Hutchins and H. L. Somers. An Introduction to Machine Translation. Academic Press, 1992.
|
| |
40
|
U. N. U. Institute of Advanced Studies. The universal networking language: Specification document. In Internal Technical Document, 1999.
|
| |
41
|
|
| |
42
|
|
| |
43
|
|
| |
44
|
D. Koller and M. Sahami. Toward optimal feature selection. In L. Saitta, editor, International Conference on Machine Learning, volume 13. Morgan-Kaufmann, 1996.
|
| |
45
|
|
| |
46
|
K. S. Kumarvel. Automatic hypertext creation. M. Tech Thesis, Computer Science and Engineering Department, IIT Bombay, 1997.
|
| |
47
|
P.-S. Laplace. Philosophical Essays on Probabilities. Springer-Verlag, New York, 1995. Translated by A. I. Dale from the 5th French edition of 1825.
|
| |
48
|
R. Larson. Bibliometrics of the world wide web: An exploratory analysis of the intellectual structure of cyberspace. In Annual Meeting of the American Society for Information Science, 1996. Online at http://sherlock.berkeley.edu/asis96/asis96.html.
|
| |
49
|
S. Lawrence and C. Lee Giles. Accessibility of information on the web. Nature, 400:107--109, July 1999.
|
| |
50
|
|
| |
51
|
A. McCallum and K. Nigam. A comparison of event models for naive Bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41--48. AAAI Press, 1998. Also technical report WS-98-05, CMU; online at http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf.
|
| |
52
|
|
| |
53
|
|
 |
54
|
|
| |
55
|
M. S. Mizruchi, P. Mariolis, M. Schwartz, and B. Mintz. Techniques for disaggregating centrality scores in social networks. In N. B. Tuma, editor, Sociological Methodology, pages 26--48. Jossey-Bass, San Francisco, 1986.
|
| |
56
|
T. K. Moon and W. C. Sterling. Mathematical Methods and Algorithms for Signal Processing. Prentice Hall, 1 edition, Aug. 1999.
|
| |
57
|
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In IJCAI'99 Workshop on Information Filtering, 1999. Online at http://www.cs.cmu.edu/~mccallum/papers/maxent-ijcaiws99.ps.gz.
|
| |
58
|
C. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent sematic indexing: A probabilistic analysis. Submitted for publication.
|
| |
59
|
J. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report MSR-TR-98-14, Microsoft Research, 1998. Online at http://www.research.microsoft.com/users/jplatt/smoTR.pdf.
|
| |
60
|
|
| |
61
|
E. S. Ristad. A natural law of succession. Research report CS-TR-495-95, Princeton University, July 1995.
|
| |
62
|
|
| |
63
|
|
| |
64
|
|
| |
65
|
R. G. Schank and C. J. Rieger. Inference and computer understanding of natural language. In in Readings in Knowledge Representation, R. J. Brachman and H. J. Levesque (ed.), Morgan Kaufmann Publishers, 1985.
|
 |
66
|
|
| |
67
|
|
| |
68
|
D. Temperley. An introduction to link grammar parser. Technical report, Apr. 1999. Online at http://www.link.cs.cmu.edu/link/dict/introduction.html.
|
| |
69
|
V. Vapnik, S. Golowich, and A. J. Smola. Support vector method for function approximation, regression estimation, and signal processing. In Advances in Neural Information Processing Systems. MIT Press, 1996.
|
| |
70
|
S. Wasserman and K. Faust. Social Network Analysis. Cambridge University Press, 1994.
|
| |
71
|
|
| |
72
|
|
CITED BY 26
|
|
Darse Billings , Lourdes Peña , Jonathan Schaeffer , Duane Szafron, Learning to play strong poker, Machines that learn to play games, Nova Science Publishers, Inc., Commack, NY, 2001
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Reza Zaefarian , Jawed Siddiqi , Babak Akhgar , Ghasem Zaefarian, A new algorithm for term weighting in text summarization process, Proceedings of the 6th WSEAS International Conference on Applied Informatics and Communications, p.292-297, August 18-20, 2006, Elounda, Greece
|
|
|
|
|
|
|
|
|
|
|
|
|
|