ACM Home Page
Please provide us with feedback. Feedback
An empirical study on selective sampling in active learning for splog detection
Full text PdfPdf (1.33 MB)
Source ACM International Conference Proceeding Series archive
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web table of contents
Madrid, Spain
SESSION: Content analyis table of contents
Pages 29-36  
Year of Publication: 2009
ISBN:978-1-60558-438-6
Authors
Taichi Katayama  University of Tsukuba, Tsukuba, Japan
Takehito Utsuro  University of Tsukuba, Tsukuba, Japan
Yuuki Sato  University of Tsukuba, Tsukuba, Japan
Takayuki Yoshinaka  Tokyo Denki University, Tokyo, Japan
Yasuhide Kawada  Navix Co., Ltd., Tokyo, Japan
Tomohiro Fukuhara  University of Tokyo, Kashiwa, Japan
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 16,   Downloads (12 Months): 51,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1531914.1531921
What is a DOI?

ABSTRACT

This paper studies how to reduce the amount of human supervision for identifying splogs / authentic blogs in the context of continuously updating splog data sets year by year. Following the previous works on active learning, against the task of splog / authentic blog detection, this paper empirically examines several strategies for selective sampling in active learning by Support Vector Machines (SVMs). As a confidence measure of SVMs learning, we employ the distance from the separating hyperplane to each test instance, which have been well studied in active learning for text classification. Unlike those results of applying active learning to text classification tasks, in the task of splog / authentic blog detection of this paper, it is not the case that adding least confident samples peforms best.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Wikipedia, Spam blog. http://en.wikipedia.org/wiki/Spam_blog.
 
2
Wikipedia, Ping (blogging). http://en.wikipedia.org/wiki/Ping_(blogging).
 
3
N. Glance, M. Hurst, and T. Tomokiyo. Blogpulse: Automated trend discovery for Weblogs. In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004.
 
4
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. 1st AIRWeb, pages 39--47, 2005.
 
5
P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog identification and Splog detection. In Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing Weblogs, pages 92--99, 2006.
 
6
P. Kolari, T. Finin, and A. Joshi. Spam in blogs and social media. In Tutorial at ICWSM, 2007.
 
7
P. Kolari, A. Joshi, and T. Finin. Characterizing the splogosphere. In Proc. 3rd Ann. Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006.
 
8
L. I. Kuncheva. Classifier ensembles for detecting concept change in streaming data: Overview and perspectives. In Proc. 2nd Workshop SUEMA 2008 (ECAI 2008), pages 5--10, 2008.
 
9
10
 
11
C. Macdonald and I. Ounis. The TREC Blogs06 collection: Creating and analysing a blog test collection. Technical Report TR-2006-224, University of Glasgow, Department of Computing Science, 2006.
 
12
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proc. 1st AIRWeb, 2005.
13
14
 
15
 
16
 
17
18

Collaborative Colleagues:
Taichi Katayama: colleagues
Takehito Utsuro: colleagues
Yuuki Sato: colleagues
Takayuki Yoshinaka: colleagues
Yasuhide Kawada: colleagues
Tomohiro Fukuhara: colleagues