annotations, as an “outer emotion”, as
perceived by others, as learning target.
Obviously, this can be highly different
from the “inner emotion” of an individual. To assess it, one will first need
a ground truth measurement method,
for example, by deeper insight into
the cognitive processes as measured
by EEG or other suited means. Then,
one will also have to develop models
that are robust against differences
between expressed emotion and the
experienced one—potentially by deriving further information from the
voice which is usually not accessible
to humans such as the heart rate, skin
conductance, current facial expression, body posture, or eye contact, 32
and many further bio-signals.
Obviously, one can think of many
further interesting challenges such
as emotion recognition “from a chips
bag” by high-speed camera capture of
the vibrations induced by the acoustic
waves, 9 in space, under water, and, of
course, in animal vocalizations.
In this article, I elaborated on making
machines hear our emotions from
end to end—from the early studies on
acoustic correlates of emotion8, 16, 21, 34
to the first patent41 in 1978, the first
seminal paper in the field, 10 to the first
end-to-end learning system. 37 We are
still learning. Based on this evolution,
an abstracted summary is shown in
Figure 2 presenting the main features
of a modern engine. Hopefully, current dead-ends, such as the lack of rich
amounts of spontaneous data that allow for coping with speaker variation,
can be overcome. After more than 20
years into automatic recognition of
emotion in the speech signal, we are
currently witnessing exciting times of
change: data learned features, synthesized training material, holistic architectures, and learning in an increasingly autonomous way—all of which
can be expected to soon lead to the
rise of broad day-to-day usage in many
health, retrieval, security, and further
beneficial use-cases alongside—after
years of waiting36—the advent of emotionally intelligent speech interfaces.
The research leading to these results
has received funding from the Europe-
an Union’s HORIZON 2020 Framework
Programme under the Grant Agree-
ment No. 645378.
1. Abdelwahab, M. and Busso, C. Supervised domain
adaptation for emotion recognition from speech. In
Proceedings of ICASSP. (Brisbane, Australia, 2015).
2. Anagnostopoulos, C.-N., Iliou, T. and Giannoukos,
I. Features and classifiers for emotion recognition
from speech: a survey from 2000 to 2011. Artificial
Intelligence Review 43, 2 (2015), 155–177.
3. Bhaykar, M., Yadav, J. and Rao, K. S. Speaker
dependent, speaker independent and cross language
emotion recognition from speech using GMM and
HMM. In Proceedings of the National Conference on
Communications. (Delhi, India, 2013). IEEE, 1–5.
4. Blanton, S. The voice and the emotions. Q. Journal of
Speech 1, 2 (1915), 154–172.
5. Chang, J. and Scherer, S. Learning Representations of
Emotional Speech with Deep Convolutional Generative
Adversarial Networks. arxiv.org, (arXiv:1705.02394),
6. Chen, L., Mao, X., Xue, Y. and Cheng, L.L. Speech
emotion recognition: Features and classification
models. Digital Signal Processing 22, 6 (2012),
7. Cibau, N. E., Albornoz. E. M., and Rufiner, H. L. Speech
emotion recognition using a deep autoencoder. San
Carlos de Bariloche, Argentina, 2013, 934–939.
8. Dar win, C. The Expression of Emotion in Man and
Animals. Watts, 1948.
9. Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G. J.,
Durand, F. and Freeman, W. T. The visual microphone:
Passive recovery of sound from video. ACM Trans.
Graphics 33, 4 (2014), 1–10.
10. Dellaert, F., Polzin, T. and Waibel, A. Recognizing
emotion in speech. In Proceedings of ICSLP 3,
(Philadelphia, PA, 1996). IEEE, 1970–1973.
11. Deng, J. Feature Transfer Learning for Speech
Emotion Recognition. PhD thesis, Dissertation,
Technische Universität München, Germany, 2016.
12. Deng, J., Xu, X., Zhang, Z., Frühholz, S., and Schuller,
B. Semisupervised Autoencoders for Speech Emotion
Recognition. IEEE/ACM Transactions on Audio,
Speech, and Language Processing 26, 1 (2018), 31–43.
13. Devillers, L., Vidrascu, L. and Lamel, L. Challenges
in real-life emotion annotation and machine learning
based detection. Neural Networks 18, 4 (2005),
14. Dhall, A., Goecke, R., Joshi, J., Sikka, K. and Gedeon,
T. Emotion recognition in the wild challenge 2014:
Baseline, data and protocol. In Proceedings of ICMI
(Istanbul, Turkey, 2014). ACM, 461–466.
15. El Ayadi, M., Kamel, M.S., and Karray, F. Survey on
speech emotion recognition: Features, classification
schemes, and databases. Pattern Recognition 44, 3
16. Fairbanks, G. and Pronovost, W. Vocal pitch during
simulated emotion. Science 88, 2286 (1938), 382–383.
17. Gunes, H. and Schuller, B. Categorical and
dimensional affect analysis in continuous input:
Current trends and future directions. Image and
Vision Computing 31, 2 (2013), 120–136.
18. Joachims, T. Learning to classify text using support
vector machines: Methods, theory and algorithms.
Kluwer Academic Publishers, 2002.
19. Kim, Y., Lee, H. and Provost, E. M. Deep learning for
robust feature generation in audiovisual emotion
recognition. In Proceedings of ICASSP, (Vancouver,
Canada, 2013). IEEE, 3687–3691.
20. Koolagudi, S.G. and Rao, K.S. Emotion recognition
from speech: A review. Intern. J. of Speech
Technology 15, 2 (2012), 99–117.
21. Kramer, E. Elimination of verbal cues in judgments
of emotion from voice. The J. Abnormal and Social
Psychology 68, 4 (1964), 390.
22. Kraus, M. W. Voice-only communication enhances
empathic accuracy. American Psychologist 72, 7
23. Lee, C.M., Narayanan, S.S., and Pieraccini, R.
Combining acoustic and language information
for emotion recognition. In Proceedings of
INTERSPEECH, (Denver, CO, 2002). ISCA, 873–876.
24. Leng, Y., Xu, X., and Qi, G. Combining active learning and
semi-supervised learning to construct SVM classifier.
Knowledge-Based Systems 44 (2013), 121–131.
25. Liu, J., Chen, C., Bu, J., You, M. and Tao, J. Speech
emotion recognition using an enhanced co-training
algorithm. In Proceedings ICME. (Beijing, P.R. China,
2007). IEEE, 999–1002.
26. Lotfian, R. and Busso, C. Emotion recognition using
synthetic speech as neutral reference. In Proceedings
of ICASSP. (Brisbane, Australia, 2015). IEEE,
27. Mao, Q., Dong, M., Huang, Z. and Zhan, Y. Learning
salient features for speech emotion recognition
using convolutional neural networks. IEEE Trans.
Multimedia 16, 8 (2014), 2203–2213.
28. Marsella, S. and Gratch, J. Computationally modeling
human emotion. Commun. ACM 57, 12 (Dec. 2014), 56–67.
29. Picard, R. W. and Picard, R. Affective Computing, vol.
252. MIT Press Cambridge, MA, 1997.
30. Ram, C.S. and Ponnusamy, R. Assessment on speech
emotion recognition for autism spectrum disorder
children using support vector machine. World Applied
Sciences J. 34, 1 (2016), 94–102.
31. Schmitt, M., Ringeval, F. and Schuller, B. At the border
of acoustics and linguistics: Bag-of-audio-words for
the recognition of emotions in speech. In Proceedings
of IN TERSPEECH. (San Francisco, CA, 2016). ISCA,
32. Schuller, B. and Batliner, A. Computational
Paralinguistics: Emotion, Affect and Personality in
Speech and Language Processing. Wiley, 2013.
33. Schuller, B, Mousa, A. E.-D., and Vasileios, V. Sentiment
analysis and opinion mining: On optimal parameters
and performances. WIREs Data Mining and
Knowledge Discovery (2015), 5:255–5:263.
34. Soskin, W.F. and Kauffman, P.E. Judgment of emotion in
word-free voice samples. J. of Commun. 11, 2 (1961), 73–80.
35. Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G.
and Schuller, B. Deep neural networks for acoustic
emotion recognition: Raising the benchmarks. In
Proceedings of ICASSP. (Prague, Czech Republic,
36. Tosa, N. and Nakatsu, R. Life-like communication
agent-emotion sensing character ‘MIC’ and feeling
session character ‘MUSE.’ In Proceedings of the 3rd
International Conference on Multimedia Computing
and Systems. (Hiroshima, Japan, 1996). IEEE, 12–19.
37. Trigeorgis, G., Ringeval, F., Brückner, R., Marchi, E.,
Nicolaou, M., Schuller, B. and Zafeiriou, S. Adieu
features? End-to-end speech emotion recognition
using a deep convolutional recurrent network. In
Proceedings of ICASSP. (Shanghai, P.R. China, 2016).
38. Ververidis, D. and Kotropoulos, C. Emotional speech
recognition: Resources, features, and methods. Speech
Commun. 48, 9 (2006), 1162–1181.
39. Watson, D., Clark, L. A., and Tellegen, A. Development
and validation of brief measures of positive and
negative affect: the PANAS scales. J. of Personality
and Social Psychology 54, 6 (1988), 1063.
40. Weninger, F., Eyben, F., Schuller, B. W., Mortillaro, M.,
and Scherer, K.R. On the acoustics of emotion in audio:
What speech, music and sound have in common.
Frontiers in Psychology 4, Article ID 292 (2013), 1–12.
41. Williamson, J. Speech analyzer for analyzing pitch or
frequency perturbations in individual speech pattern
to determine the emotional state of the person. U.S.
Patent 4,093,821, 1978.
42. Wöllmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C.,
Douglas-Cowie, E. and Cowie, R. Abandoning emotion
classes—Towards continuous emotion recognition
with modeling of long-range dependencies. In
Proceedings of INTERSPEECH. (Brisbane, Australia,
2008). ISCA, 597–600.
43. Zeng, Z., Pantic, M., Roisman, G.I., and Huang, T.S. A
survey of affect recognition methods: Audio, visual,
and spontaneous expressions. IEEE Trans. Pattern
Analysis and Machine Intelligence 31, 1 (2009), 39–58.
Björn W. Schuller ( firstname.lastname@example.org) is a professor
and head of the ZD. B Chair of Embedded Intelligence
for Health Care and Wellbeing at the the University of
© 2018 ACM 0001-0782/18/5 $15.00
Watch the author discuss
his work in this exclusive