ACM Home Page
Please provide us with feedback. Feedback
Large-scale content-based audio retrieval from text queries
Full text PdfPdf (220 KB)
Source
International Multimedia Conference archive
Proceeding of the 1st ACM international conference on Multimedia information retrieval table of contents
Vancouver, British Columbia, Canada
SESSION: Audio retrieval table of contents
Pages 105-112  
Year of Publication: 2008
ISBN:978-1-60558-312-9
Authors
Gal Chechik  Google, Mountain View, CA, USA
Eugene Ie  Google, Mountain View, CA, USA
Martin Rehn  Google, Mountain View, CA, USA
Samy Bengio  Google, Mountain View, CA, USA
Dick Lyon  Google, Mountain View, CA, USA
Sponsors
SIGMULTIMEDIA: ACM Special Interest Group on Multimedia
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 162,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1460096.1460115
What is a DOI?

ABSTRACT

In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM).

We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Amir, G. Iyengar, J. Argillander, M. Campbell, A. Haubold, S. Ebadollahi, F. Kang, M. R. Naphade, A. Natsev, J. R. Smith, J. Tesic, and T. Volkmer. IBM research TRECVID-2005 video retrieval system. In TREC Video Workshop 2005.
 
2
Anonymous.http://sound1sound.googlepages.com.
 
3
J. J. Aucouturier. Ten Experiments on the Modelling of Polyphonic Timbre PhD thesis, Univ. Paris 6, 2006.
 
4
 
5
 
6
Freesound.http://freesound.iua.upf.edu.
 
7
J. Gauvain and C. Lee. Maximum a posteriori estimation for multivariate gaussian mixture observation of Markov chains. In IEEE Trans. on Speech Audio Process. volume 2, pages 291--298, 1994.
 
8
D. Grangier, F. Monay, and S. Bengio. A discriminative approach for the retrieval of images from text queries. In European Conference on Machine Learning, ECML, Lecture Notes in Computer Science volume LNCS 4212. Springer-Verlag, 2006.
9
10
 
11
J. Mariéthoz and S. Bengio. A comparative study of adaptation methods for speaker verification. In Proc. Int. Conf. on Spoken Lang. Processing, ICSLP 2002.
 
12
 
13
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted gaussian mixture models. Digital Signal Processing 10(1--3), 2000.
 
14
M. Slaney, I. Center, and C. San Jose. Semantic-audio retrieval. In ICASSP volume 4, 2002.
15
 
16
D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Semantic annotation and retrieval of music and sound effects. In IEEE Transactions on Audio, Speech and Language Processing 2008.
 
17
18
 
19
P. Wan and L. Lu. Content-based audio retrieval:a comparative study of various features and similarity measures. Proceedings of SPIE 6015:60151H, 2005.


Collaborative Colleagues:
Gal Chechik: colleagues
Eugene Ie: colleagues
Martin Rehn: colleagues
Samy Bengio: colleagues
Dick Lyon: colleagues