|
ABSTRACT
This paper proposes a formal framework for image and video retrieval using discrete Markov random fields (MRF). The training dataset consists of images with keywords (regions are not labeled). The model is built using a discrete vocabulary of vector quantized region or point features generated from the training images. Since performance is dependent on the size of the vocabulary, a large vocabulary of a couple of million visterms is used. Such large vocabularies cannot be generated by conventional clustering algorithms so hierarchical k-means is used to generate it. Unlike many previous techniques, our MRF based model doesn't require an explicit annotation step for retrieval. The model directly ranks all test images according to the posterior probability of an image given a query. Traditionally, most models are trained by maximizing likelihood - instead this model is trained by maximizing average precision. Image and video retrieval experiments are performed on two standard datasets (a Corel dataset and a TRECVID3 dataset) which consist of 4,500 images and about 44,100 keyframes respectively. The results show that based on a large visual vocabulary the model runs extremely fast on even very large datasets while having comparable retrieval performance to the best performing (continuous feature) models.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In Proc. ICCV, volume 2, pages 408--415, 2001.
|
 |
3
|
|
| |
4
|
Carbonetto, N. de Freitas, and K. Barnard. A statistical model for general contextual object recognition. In Proc. ECCV, 2004.
|
| |
5
|
Carbonetto, N. de Freitas, P. Gustafson, and N. Thompson. Bayesian feature weighting for unsupervised learning, with application to object recognition. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, 2003.
|
| |
6
|
|
| |
7
|
|
| |
8
|
S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In Proc. CVPR, pages 1002--1009, 2004.
|
 |
9
|
|
| |
10
|
J. S. Hare, P. H. Lewis, P. Enser, and C. J. Sandom. A linear-algebraic technique with an application in semantic image retrieval. In In CIVR06, 2006.
|
 |
11
|
G. Iyengar , P. Duygulu , S. Feng , P. Ircing , S. P. Khudanpur , D. Klakow , M. R. Krause , R. Manmatha , H. J. Nock , D. Petkova , B. Pytlik , P. Virga, Joint visual-text modeling for automatic retrieval of multimedia documents, Proceedings of the 13th annual ACM international conference on Multimedia, November 06-11, 2005, Hilton, Singapore
[doi> 10.1145/1101149.1101154]
|
 |
12
|
|
| |
13
|
J. Jeon and R. Manmatha. Using maximum entropy for automatic image annotation. In Proceedings of the 3rd International Conference on Image and Video Retrieval, pages 24--32, 2004.
|
| |
14
|
V. Lavrenko, S. L. Feng, and R. Manmatha. Statistical models for automatic video annotation and retrieval. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pages 1044--1047, 2004.
|
| |
15
|
V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. In Proceedings of Advances in Neural Information Processing Systems 16, NIPS 2003., 2003.
|
| |
16
|
|
| |
17
|
|
| |
18
|
J. Magalhães and S. M. Rüger. Logistic regression of generic codebooks for semantic image retrieval. In CIVR, pages 41--50, 2006.
|
 |
19
|
|
| |
20
|
D. Metzler and R. Manmatha. An inference network approach to image retrieval. In Proceedings of the 3rd International Conference on Image and Video Retrieval, pages 42--50, 2004.
|
| |
21
|
W. Morgan, W. Greiff, and J. Henderson. Direct maximization of average precision by hill-with a comparison to a maximum entropy approach. Technical report, MITRE, 2004.
|
| |
22
|
|
| |
23
|
J. Philbin1, O. Chum1, M. Isard2, J. Sivic1, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Proc. of CVPR, 2007.
|
| |
24
|
R. Shi, T.-S. Chua, C.-H. Lee, and S. Gao. Bayesian learning of hierarchical multinomial mixture models of concepts for automatic image annotation. In CIVR, pages 102--112, 2006.
|
| |
25
|
|
| |
26
|
L. Xie, L. Kennedy, S.-F. Chang, A. Divakaran, H. Sun, and C.-Y. Lin. Discovering meaningful multimedia patterns with audio-visual concepts and associated text. In IEEE International Conference on Image Processing, October, 2004.
|
|