|
ABSTRACT
Bag-of-features (BoF) deriving from local keypoints has recently appeared promising for object and scene classification. Whether BoF can naturally survive the challenges such as reliability and scalability of visual classification, nevertheless, remains uncertain due to various implementation choices. In this paper, we evaluate various factors which govern the performance of BoF. The factors include the choices of detector, kernel, vocabulary size and weighting scheme. We offer some practical insights in how to optimize the performance by choosing good keypoint detector and kernel. For the weighting scheme, we propose a novel soft-weighting method to assess the significance of a visual word to an image. We experimentally show that the proposed soft-weighting scheme can consistently offer better performance than other popular weighting methods. On both PASCAL-2005 and TRECVID-2006 datasets, our BoF setting generates competitive performance compared to the state-of-the-art techniques. We also show that the BoF is highly complementary to global features. By incorporating the BoF with color and texture features, an improvement of 50% is reported on TRECVID-2006 dataset.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
LSCOM lexicon definitions and annotations. In DTO Challenge Workshop on Large Scale Concept Ontology for Multimedia, Columbia University ADVENT Technical Report #217-2006-3, 2006.
|
| |
2
|
A. C. Berg and J. Malik. Geometric blur for template matching. In IEEE CVPR, 2001.
|
| |
3
|
M. Campbell et al. IBM research trecvid-2006 video retrieval system. In TRECVID, 2006.
|
| |
4
|
J. Cao et al. Intelligent multimedia group of Tsinghua university at trecvid 2006. In TRECVID, 2006.
|
| |
5
|
C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm, 2001.
|
| |
6
|
O. Chapelle, P. Haffner, and V. N. Vapnik. Support vector machines for histogram-based image classification. IEEE Trans. on NN, 10(5), 1999.
|
| |
7
|
M. Everingham et al. The 2005 pascal visual object classes challenge. In LNAI, volume 3944, pages 117--176. Springer-Verlag, 2005.
|
| |
8
|
|
| |
9
|
A. G. Hauptmann et al. Multi-lingual broadcast news retrieval. In TRECVID, 2006.
|
| |
10
|
D. Larlus, G. Dorko, and F. Jurie. Creation de vocabulaires visuels efficaces pour la categorisation d'images. In Reconnaissance des Formes et Intelligence Artificielle, 2006.
|
| |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
K. Mikolajczyk , T. Tuytelaars , C. Schmid , A. Zisserman , J. Matas , F. Schaffalitzky , T. Kadir , L. Van Gool, A Comparison of Affine Region Detectors, International Journal of Computer Vision, v.65 n.1-2, p.43-72, November 2005
[doi> 10.1007/s11263-005-3848-x]
|
| |
15
|
|
| |
16
|
A. Agarwal, and B. Triggs. Hyperfeatures - multilevel local coding for visual recognition. In ECCV, 2006.
|
| |
17
|
E. Nowak et al. Sampling strategies for bag-of-features image classification. In ECCV, 2006.
|
| |
18
|
F. Odone et al. Building kernels from binary strings for image matching. IEEE Trans. on IP, 14(2), 2005.
|
| |
19
|
S. Petrov et al. Detecting categories in news video using acoustic, speech, and image features. In TRECVID, 2006.
|
| |
20
|
J. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61--74, 2000.
|
| |
21
|
|
| |
22
|
C. G. M. Snoek et al. The mediamill trecvid 2006 semantic video search engine. In TRECVID, 2006.
|
| |
23
|
TREC Video Retrieval Evaluation (TRECVID). http://www-nlpir.nist.gov/projects/trecvid/.
|
| |
24
|
|
| |
25
|
J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: An in-depth study. In INRIA Technical Report RR-5737, 2005.
|
CITED BY 13
|
|
|
|
|
|
|
|
Jun Yang , Yu-Gang Jiang , Alexander G. Hauptmann , Chong-Wah Ngo, Evaluating bag-of-visual-words representations in scene classification, Proceedings of the international workshop on Workshop on multimedia information retrieval, September 24-29, 2007, Augsburg, Bavaria, Germany
|
|
|
|
|
|
|
|
|
Guo-Jun Qi , Xian-Sheng Hua , Yong Rui , Jinhui Tang , Tao Mei , Meng Wang , Hong-Jiang Zhang, Correlative multilabel video annotation with temporal kernels, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), v.5 n.1, p.1-27, October 2008
|
|
|
|
|
|
Xiao-Yong Wei , Chong-Wah Ngo, Fusing semantics, observability, reliability and diversity of concept detectors for video search, Proceeding of the 16th ACM international conference on Multimedia, October 26-31, 2008, Vancouver, British Columbia, Canada
|
|
|
Hung-Khoon Tan , Xiao Wu , Chong-Wah Ngo , Wan-Lei Zhao, Accelerating near-duplicate video matching by combining visual similarity and alignment distortion, Proceeding of the 16th ACM international conference on Multimedia, October 26-31, 2008, Vancouver, British Columbia, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Meng Wang , Xian-Sheng Hua , Richang Hong , Jinhui Tang , Guo-Jun Qi , Yan Song, Unified video annotation via multigraph learning, IEEE Transactions on Circuits and Systems for Video Technology, v.19 n.5, p.733-746, May 2009
|
|