| An empirical study on selective sampling in active learning for splog detection |
| Full text |
Pdf
(1.33 MB)
|
| Source
|
ACM International Conference Proceeding Series
archive
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
table of contents
Madrid, Spain
SESSION: Content analyis
table of contents
Pages 29-36
Year of Publication: 2009
ISBN:978-1-60558-438-6
|
|
Authors
|
|
Taichi Katayama
|
University of Tsukuba, Tsukuba, Japan
|
|
Takehito Utsuro
|
University of Tsukuba, Tsukuba, Japan
|
|
Yuuki Sato
|
University of Tsukuba, Tsukuba, Japan
|
|
Takayuki Yoshinaka
|
Tokyo Denki University, Tokyo, Japan
|
|
Yasuhide Kawada
|
Navix Co., Ltd., Tokyo, Japan
|
|
Tomohiro Fukuhara
|
University of Tokyo, Kashiwa, Japan
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 16, Downloads (12 Months): 51, Citation Count: 0
|
|
|
ABSTRACT
This paper studies how to reduce the amount of human supervision for identifying splogs / authentic blogs in the context of continuously updating splog data sets year by year. Following the previous works on active learning, against the task of splog / authentic blog detection, this paper empirically examines several strategies for selective sampling in active learning by Support Vector Machines (SVMs). As a confidence measure of SVMs learning, we employ the distance from the separating hyperplane to each test instance, which have been well studied in active learning for text classification. Unlike those results of applying active learning to text classification tasks, in the task of splog / authentic blog detection of this paper, it is not the case that adding least confident samples peforms best.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Wikipedia, Spam blog. http://en.wikipedia.org/wiki/Spam_blog.
|
| |
2
|
Wikipedia, Ping (blogging). http://en.wikipedia.org/wiki/Ping_(blogging).
|
| |
3
|
N. Glance, M. Hurst, and T. Tomokiyo. Blogpulse: Automated trend discovery for Weblogs. In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004.
|
| |
4
|
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. 1st AIRWeb, pages 39--47, 2005.
|
| |
5
|
P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog identification and Splog detection. In Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing Weblogs, pages 92--99, 2006.
|
| |
6
|
P. Kolari, T. Finin, and A. Joshi. Spam in blogs and social media. In Tutorial at ICWSM, 2007.
|
| |
7
|
P. Kolari, A. Joshi, and T. Finin. Characterizing the splogosphere. In Proc. 3rd Ann. Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006.
|
| |
8
|
L. I. Kuncheva. Classifier ensembles for detecting concept change in streaming data: Overview and perspectives. In Proc. 2nd Workshop SUEMA 2008 (ECAI 2008), pages 5--10, 2008.
|
| |
9
|
|
 |
10
|
Yu-Ru Lin , Hari Sundaram , Yun Chi , Junichi Tatemura , Belle L. Tseng, Splog detection using self-similarity analysis on blog temporal dynamics, Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, May 08-08, 2007, Banff, Alberta, Canada
[doi> 10.1145/1244408.1244410]
|
| |
11
|
C. Macdonald and I. Ounis. The TREC Blogs06 collection: Creating and analysing a blog test collection. Technical Report TR-2006-224, University of Glasgow, Department of Computing Science, 2006.
|
| |
12
|
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proc. 1st AIRWeb, 2005.
|
 |
13
|
Tomoyuki Nanno , Toshiaki Fujiki , Yasuhiro Suzuki , Manabu Okumura, Automatically collecting, monitoring, and mining japanese weblogs, Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, May 19-21, 2004, New York, NY, USA
[doi> 10.1145/1013367.1013455]
|
 |
14
|
Yuuki Sato , Takehito Utsuro , Yoshiaki Murakami , Tomohiro Fukuhara , Hiroshi Nakagawa , Yasuhide Kawada , Noriko Kando, Analysing features of Japanese splogs and characteristics of keywords, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, April 22-22, 2008, Beijing, China
[doi> 10.1145/1451983.1451993]
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
 |
18
|
Yi-Min Wang , Ming Ma , Yuan Niu , Hao Chen, Spam double-funnel: connecting web spammers with advertisers, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242612]
|
|