|
ABSTRACT
The world wide web contains rich textual contents that areinterconnected via complex hyperlinks. This huge database violates the assumption held by most of conventional statistical methods that each web page is considered as an independent and identical sample. It is thus difficult to apply traditional mining or learning methods for solving web mining problems, e.g., web page classification, by exploiting both the content and the link structure. The research in this direction has recently received considerable attention but are still in an early stage. Though a few methods exploit both the link structure or the content information, some of them combine the only authority information with the content information, and the others first decompose the link structure into hub and authority features, then apply them as additional document features. Being practically attractive for its great simplicity, this paper aims to design an algorithm that exploits both the content and linkage information, by carrying out a joint factorization on both the linkage adjacency matrix and the document-term matrix, and derives a new representation for web pages in a low-dimensional factor space, without explicitly separating them as content, hub or authority factors. Further analysis can be performed based on the compact representation of web pages. In the experiments, the proposed method is compared with state-of-the-art methods and demonstrates an excellent accuracy in hypertext classification on the WebKB and Cora benchmarks.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
CMU world wide knowledge base (WebKB) project. Available at http://www.cs.cmu.edu/?WebKB/.
|
| |
2
|
|
 |
3
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
4
|
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/?cjlin/libsvm.
|
| |
5
|
|
| |
6
|
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430--436. MIT Press, 2001.
|
| |
7
|
|
| |
8
|
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
|
| |
9
|
X. He, H. Zha, C. Ding, and H. Simon. Web document clustering using hyperlink structures. Computational Statistics and Data Analysis, 41(1):19--45, 2002.
|
 |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog Identification and Splog Detection. In AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, March 2006.
|
 |
14
|
|
| |
15
|
|
 |
16
|
|
| |
17
|
L. Page, S. Brin, R. Motowani, and T. Winograd. PageRank citation ranking: bring order to the web. Stanford Digital Library working paper 1997--0072, 1997.
|
| |
18
|
C. Spearman. "General Intelligence," objectively determined and measured. The American Journal of Psychology, 15(2):201--292, Apr 1904.
|
| |
19
|
B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceedings of 18th International UAI Conference, 2002.
|
 |
20
|
|
| |
21
|
|
 |
22
|
|
 |
23
|
|
 |
24
|
|
| |
25
|
D. Zhou, B. Schölkopf, and T. Hofmann. Semi-supervised learning on directed graphs. Proc. Neural Info. Processing Systems, 2004.
|
CITED BY 15
|
|
Ding Zhou , Shenghuo Zhu , Kai Yu , Xiaodan Song , Belle L. Tseng , Hongyuan Zha , C. Lee Giles, Learning multiple graphs for document recommendations, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
Yun Chi , Shenghuo Zhu , Yihong Gong , Yi Zhang, Probabilistic polyadic factorization and its application to personalized recommendation, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
Hao Ma , Haixuan Yang , Michael R. Lyu , Irwin King, SoRec: social recommendation using probabilistic matrix factorization, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
Hao Ma , Haixuan Yang , Irwin King , Michael R. Lyu, Learning latent semantic relations from clickthrough data for query suggestion, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
Yu-Ru Lin , Hari Sundaram , Aisling Kelliher, Summarization of social activity over time: people, actions and concepts in dynamic networks, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Yu-Ru Lin , Jimeng Sun , Paul Castro , Ravi Konuru , Hari Sundaram , Aisling Kelliher, Extracting community structure through relational hypergraphs, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
|
|
|
|
|
|
Yu-Ru Lin , Jimeng Sun , Paul Castro , Ravi Konuru , Hari Sundaram , Aisling Kelliher, MetaFac: community discovery via relational hypergraph factorization, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
Dan Zhang , Fei Wang , Changshui Zhang , Tao Li, Multi-view local learning, Proceedings of the 23rd national conference on Artificial intelligence, p.752-757, July 13-17, 2008, Chicago, Illinois
|
|