|
ABSTRACT
Image-based videorealistic speech animation achieves significant visual realism at the cost of the collection of a large 5- to 10-minute video corpus from the specific person to be animated. This requirement hinders its use in broad applications, since a large video corpus for a specific person under a controlled recording setup may not be easily obtained In this paper, we propose a model transfer and adaptation algorithm which allows for a novel person to be animated using only a small video corpus. The algorithm starts with a multidimensional morphable model (MMM) previously trained from a different speaker with a large corpus, and transfers it to the novel speaker with a much smaller corpus. The algorithm consists of 1) a novel matching-by-synthesis algorithm which semi-automatically selects new MMM prototype images from the new video corpus and 2) a novel gradient descent linear regression algorithm which adapts the MMM phoneme models to the data in the novel video corpus. Encouraging experimental results are presented in which a morphable model trained from a performer with a 10-minute corpus is transferred to a novel person using a 15-second movie clip of him as the adaptation video corpus.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
{BBPV03} Blanz V., Basso C., Poggio T., Vetter T.: Reanimating faces in images and video. In Proc. Eurographics '03 (2003), vol. 22.
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
{CC02} Chang Y. J., Chen Y. C.: Facial model adaptation from a monocular image sequence using a textured polygonal model. Signal Processing: Image Communication 17, 5 (May 2002), 373--392.
|
| |
6
|
|
| |
7
|
{CG00} Cosatto E., Graf H. P.: Photo-realistic talking-heads from image samples. IEEE Trans. on Multimedia 2, 3 (Sept. 2000), 152--163.
|
 |
8
|
|
| |
9
|
{Gal98} Gales M. J. F.: Cluster adaptive training for speech recognition. In Proc. the 5th International Conference on Spoken Language Processing (1998), pp. 1783--1786.
|
| |
10
|
|
| |
11
|
{GL94} Gauvain J. L., Lee C. H.: Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. on Speech and Audio Processing 2, 2 (Apr. 1994), 291--298.
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
{KNJ*98} Kuhn R., Nguyen P., Junqua J. C., Goldwasser L., Niedzielski N., Fincke S., Field K., Contolini M.: Eigenvoices for speaker adaptation. In Proc. the 5th International Conference on Spoken Language Processing (1998), pp. 1771--1774.
|
 |
16
|
|
| |
17
|
{LW95} Leggetter C. J., Woodland P. C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech and Language 9, 2 (1995), 171--185.
|
| |
18
|
{NJ04} Na K., Jung M.: Hierarchical retargetting of fine facial motions. In Proc. Eurographics '04 (2004).
|
 |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
{WHL*04} Wang Y., Huang X., Lee C. S., Zhang S., Li Z., Samaras D., Metaxas D., Elgammal A., Huang P.: High resolution acquisition, learning and transfer of dynamic 3-d facial expressions. In Proc. Eurographics '04 (2004).
|
| |
23
|
|
CITED BY 6
|
|
|
|
|
Kevin Wampler , Daichi Sasaki , Li Zhang , Zoran Popović, Dynamic, expressive speech animation from a single mesh, Proceedings of the 2007 ACM SIGGRAPH/Eurographics symposium on Computer animation, August 02-04, 2007, San Diego, California
|
|
|
|
|
|
|
|
|
Barry-John Theobald , Iain A. Matthews , Jeffrey F. Cohn , Steven M. Boker, Real-time expression cloning using appearance models, Proceedings of the 9th international conference on Multimodal interfaces, November 12-15, 2007, Nagoya, Aichi, Japan
|
|
|
Javier Melenchón , Elisa Martínez , Fernando De La Torre , José A. Montero, Emphatic visual speech synthesis, IEEE Transactions on Audio, Speech, and Language Processing, v.17 n.3, p.459-468, March 2009
|
|