JANUARY 2019 | VOL. 62 | NO. 1 | COMMUNICATIONS OF THE ACM 103
consume every day, especially if there is no proof of origin. The presented system also demonstrates the need
for sophisticated fraud detection and watermarking
algorithms. We believe that the field of digital forensics
will receive a lot of attention in the future.
The presented approach is the first real-time facial reenactment system that requires just monocular RGB input. Our
live setup enables the animation of legacy video footage—for example, from Youtube—in real time. Overall,
we believe our system will pave the way for many new and
exciting applications in the fields of VR/AR, teleconferencing, or on-the-fly dubbing of videos with translated audio.
One direction for future work is to provide full control over
the target head. A properly rigged mouth and tongue model
reconstructed from monocular input data will provide control over the mouth cavity, a wrinkle formation model will
provide more realistic results by adding fine-scale surface
detail and eye-tracking will enable control over the target’s
We would like to thank Chen Cao and Kun Zhou for the
blendshape models and comparison data, as well as Volker
Blanz, Thomas Vetter, and Oleg Alexander for the provided
face data. The facial landmark tracker was kindly provided
by TrueVisionSolution. We thank Angela Dai for the video
voice over and Daniel Ritchie for video reenactment. This
research is funded by the German Research Foundation
(DFG), grant GRK-1773 Heterogeneous Image Systems, the
ERC Starting Grant 335545 CapReal, and the Max Planck
Center for Visual Computing and Communications
(MPC-VCC). We also gratefully acknowledge the support
from NVIDIA Corporation for hardware donations.
Figure 6. Comparison of our RGB tracking to Cao et al. 5 and to RGB-D
tracking by Thies et al. 19
Table 1. Avg. run times for the three sequences of Figure 5, from top
CPU GPU FPS
SparseFT MouthRT DenseFT DefTF Synth (Hz)
5.97ms 1.90ms 22.06ms 3.98ms 10.19ms 27. 6
4.85ms 1.50ms 21.27ms 4.01ms 10.31ms 28. 1
5.57ms 1.78ms 20.97ms 3.95ms 10.32ms 28. 4
Input Garrido et al. 2015 Ours
Figure 7. Dubbing: Comparison to Garrido et al. 8
Figure 8. Comparison of the proposed RGB reenactment to the RGB-D
reenactment of Thies et al. 19
Input Thies et al. 2015 Ours
experts. Our approach is a game changer, since it
enables editing of videos in real time on a commodity
PC, which makes this technology accessible to
non-experts. We hope that the numerous demonstrations of our reenactment systems will teach people to
think more critical about the video content they
a Standard deviations w.r.t. the final frame rate are 0: 51, 0: 56, and 0: 59 fps,
respectively. Note that CPU and GPU stages run in parallel.
1. Blanz, V., Vetter, T. A morphable
model for the synthesis of 3d faces.
Proc. SIGGRAPH (1999), ACM Press/
Addison- Wesley Publishing Co.,
2. Bouaziz, S., Wang, Y., Pauly, M. Online
modeling for realtime facial
animation. ACM TOG 32, 4 (2013), 40.
3. Bregler, C., Covell, M., Slaney, M.
Video rewrite: Driving visual speech
with audio. Proc. SIGGRAPH (1997),