ACM Home Page
Please provide us with feedback. Feedback
Descriptive visual words and visual phrases for image applications
Full text PdfPdf (2.24 MB)
Source
International Multimedia Conference archive
Proceedings of the seventeen ACM international conference on Multimedia table of contents
Beijing, China
SESSION: Content track C1: image retrieval table of contents
Pages 75-84  
Year of Publication: 2009
ISBN:978-1-60558-608-3
Authors
Shiliang Zhang  Key Lab of Intelligent Information Processing, Institute of Computing Technology, CAS, Beijing, China
Qi Tian  Microsoft Research Asia, Beijing, China
Gang Hua  Microsoft Live Labs Research, Redmond, USA
Qingming Huang  Graduate University of Chinese Academy of Sciences, Beijing, China
Shipeng Li  Microsoft Research Asia, Beijing, China
Sponsor
SIGMULTIMEDIA: ACM Special Interest Group on Multimedia
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 38,   Downloads (12 Months): 38,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1631272.1631285
What is a DOI?

ABSTRACT

The Bag-of-visual Words (BoW) image representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are comparable to the words in texts. However, massive experiments show that the commonly used visual words are not as expressive as the text words, which is not desirable because it hinders their effectiveness in various applications. In this paper, Descriptive Visual Words (DVWs) and Descriptive Visual Phrases (DVPs) are proposed as the visual correspondences to text words and phrases, where visual phrases refer to the frequently co-occurring visual word pairs. Since images are the carriers of visual objects and scenes, novel descriptive visual element set can be composed by the visual words and their combinations which are effective in representing certain visual objects or scenes. Based on this idea, a general framework is proposed for generating DVWs and DVPs from classic visual words for various applications. In a large-scale image database containing 1506 object and scene categories, the visual words and visual word pairs descriptive to certain scenes or objects are identified as the DVWs and DVPs. Experiments show that the DVWs and DVPs are compact and descriptive, thus are more comparable with the text words than the classic visual words. We apply the identified DVWs and DVPs in several applications including image retrieval, image re-ranking, and object recognition. The DVW and DVP combination outperforms the classic visual words by 19.5% and 80% in image retrieval and object recognition tasks, respectively. The DVW and DVP based image re-ranking algorithm: DWPRank outperforms the state-of-the-art VisualRank by 12.4% in accuracy and about 11 times faster in efficiency.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
S. Battiato, G. M. Farinella, G. Gallo, and D. Ravi. Spatial hierarchy of textons distribution for scene classification. Proc. Eurocom Multimedia Modeling, pp. 333--342, 2009.
 
2
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. International World-Wide Web Conference, pp. 107--117, 1998.
 
3
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale hierarchical image database. Proc. CVPR, pp. 710--719, 2009.
 
4
C. Fellbaum. Wordnet: an electronic lexical database. Bradford Books, 1998.
 
5
B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 16(315): 972--976, Jan. 2007.
 
6
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. Proc. VLDB, pp. 518--529, 1999.
 
7
W. H. Hsu, L. S. Kennedy, and S. F. Chang. Video search reranking through random walk over document-level context graph. Proc. ACM Multimedia, pp. 971--980, 2007.
 
8
Y. Jing and S. Baluja. VisualRank: applying PageRank to large-scale image search. PAMI, 30(11): 1877--1890, Nov. 2008.
 
9
F. Jurie and B. Triggs. Creating efficient codebooks for visual recognition. Proc. ICCV, pp. 17--21, 2005.
 
10
S. Lazebnik and M. Raginsky. Supervised learning of quantizer codebook by information loss minimization. PAMI, 31(7): 1294--1309, July 2009.
 
11
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. Proc. CVPR, pp. 2169--2178, 2006.
 
12
D. Liu, G. Hua, P. Viola, and T. Chen. Integrated feature selection and higher-order spatial feature extraction for object categorization. Proc. CVPR, pp. 1--8, 2008.
 
13
J. Liu, W. Lai, X. Hua, Y. Huang, and S. Li. Video search re-ranking via multi-graph propagation. ACM Multimedia, pp. 208--217, 2007.
 
14
D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2): 91--110, Nov. 2004.
 
15
M. Marszalek and C. Schmid. Spatial weighting for bag-of-features. Proc. CVPR, pp. 2118--2125, 2006.
 
16
F. Moosmann, E. Nowak, and F. Jurie. Randomized clustering forests for image classification. PAMI, 30(9): 1632--1646, Sep. 2008.
 
17
D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. Proc. CVPR, pp. 2161--2168, 2006.
 
18
F. Perronnin and C. Dance. Fisher kernels on visual vocabulary for image categorization. Proc. CVPR, pp. 1--8, 2007.
 
19
F. Perronnin. Universal and adapted vocabularies for generic visual categorization. PAMI, 30(7): 1243--1256, July 2008.
 
20
S. Savarese, J. Winn, and A. Criminisi. Discriminative object class models of appearance and shape by correlatons. Proc. CVPR, pp. 2033--2040, 2006.
 
21
Z. Si, H. Gong, Y. N. Wu, and S. C. Zhu. Learning mixed templates for object recognition. Proc. CVPR, 2009.
 
22
J. Sivic and A. Zisserman. Video Google: a text retrieval approach to object matching in videos. Proc. ICCV, pp. 1470--1477, 2003.
 
23
X. Tian, L. Yang, J. Wang, Y. Yang, X. Wu, and X. Hua. Bayesian video search reranking. Proc. ACM Multimedia, pp. 131--140, 2008.
 
24
A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 30(11): 1958--1970, Nov. 2008.
 
25
P. Viola and M. Jones. Robust real-time face detection. Proc. ICCV, pp. 7--14, 2001.
 
26
C. Wang, D. Blei, and L. Fei-Fei. Simultaneous image classification and annotation. Proc. CVPR, 2009.
 
27
F. Wang, Y. G. Jiang, and C. W. Ngo. Video event detection using motion relativity and visual relatedness. Proc. ACM Multimedia, pp. 239--248, 2008.
 
28
J. Winn, A. Criminisi, and T. Minka. Object categorization by learning universal visual dictionary. Proc. ICCV, pp. 17--21, 2005.
 
29
Z. Wu, Q. F. Ke, and J. Sun. Bundling features for large-scale partial-duplicate web image search. Proc. CVPR, 2009.
 
30
D. Xu and S. F. Chang. Video event recognition using kernel methods with multilevel temporal alignment. PAMI, 30(11): 1985--1997, Nov. 2008.
 
31
L. Yang, P. Meer, and D. J. Foran. Multiple class segmentation using a unified framework over mean-shift patches. Proc. CVPR, pp. 1--8, 2007.
 
32
J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: from visual words to visual phrases. Proc. CVPR, pp.1--8, 2007.
 
33
Y. T. Zheng, M. Zhao, S. Y. Neo, T. S. Chua, and Q. Tian. Visual synset: a higher-level visual representation. CVPR, pp. 1--8, 2008.
 
34
X. Zhou, X. D. Zhuang, S. C. Yan, S. F. Chang, M.H. Johnson, and T.S. Huang. SIFT-bag kernel for video event analysis. Proc. ACM Multimedia, pp. 229--238, 2008.