ACM Home Page
Please provide us with feedback. Feedback
Combining content and link for classification using matrix factorization
Full text PdfPdf (317 KB)
Source
Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Amsterdam, The Netherlands
SESSION: Link analysis table of contents
Pages: 487 - 494  
Year of Publication: 2007
ISBN:978-1-59593-597-7
Authors
Shenghuo Zhu  NEC Laboratories America: Inc., Cupertino, CA
Kai Yu  NEC Laboratories America: Inc., Cupertino, CA
Yun Chi  NEC Laboratories America: Inc., Cupertino, CA
Yihong Gong  NEC Laboratories America: Inc., Cupertino, CA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 26,   Downloads (12 Months): 222,   Citation Count: 16
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1277741.1277825
What is a DOI?

ABSTRACT

The world wide web contains rich textual contents that areinterconnected via complex hyperlinks. This huge database violates the assumption held by most of conventional statistical methods that each web page is considered as an independent and identical sample. It is thus difficult to apply traditional mining or learning methods for solving web mining problems, e.g., web page classification, by exploiting both the content and the link structure. The research in this direction has recently received considerable attention but are still in an early stage. Though a few methods exploit both the link structure or the content information, some of them combine the only authority information with the content information, and the others first decompose the link structure into hub and authority features, then apply them as additional document features. Being practically attractive for its great simplicity, this paper aims to design an algorithm that exploits both the content and linkage information, by carrying out a joint factorization on both the linkage adjacency matrix and the document-term matrix, and derives a new representation for web pages in a low-dimensional factor space, without explicitly separating them as content, hub or authority factors. Further analysis can be performed based on the compact representation of web pages. In the experiments, the proposed method is compared with state-of-the-art methods and demonstrates an excellent accuracy in hypertext classification on the WebKB and Cora benchmarks.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
CMU world wide knowledge base (WebKB) project. Available at http://www.cs.cmu.edu/?WebKB/.
 
2
3
 
4
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/?cjlin/libsvm.
 
5
 
6
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430--436. MIT Press, 2001.
 
7
 
8
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
 
9
X. He, H. Zha, C. Ding, and H. Simon. Web document clustering using hyperlink structures. Computational Statistics and Data Analysis, 41(1):19--45, 2002.
10
 
11
12
 
13
P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog Identification and Splog Detection. In AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, March 2006.
14
 
15
16
 
17
L. Page, S. Brin, R. Motowani, and T. Winograd. PageRank citation ranking: bring order to the web. Stanford Digital Library working paper 1997--0072, 1997.
 
18
C. Spearman. "General Intelligence," objectively determined and measured. The American Journal of Psychology, 15(2):201--292, Apr 1904.
 
19
B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceedings of 18th International UAI Conference, 2002.
20
 
21
22
23
24
 
25
D. Zhou, B. Schölkopf, and T. Hofmann. Semi-supervised learning on directed graphs. Proc. Neural Info. Processing Systems, 2004.

CITED BY  16

Collaborative Colleagues:
Shenghuo Zhu: colleagues
Kai Yu: colleagues
Yun Chi: colleagues
Yihong Gong: colleagues