side code to make experience and/or
algorithmic modifications, that usage
can be easily aggregated and analyzed
to improve assistance offered, and that
assistants can run on third-party hardware, enabling scaling via Amazonc
and Googled skills kits. Skill availability
is only part of the challenge for digital
assistants; discoverability of skills remains an issue on devices that lack displays. 12 Screen devices can surface skill
recommendations and support recall
of prior skills on headless devices.
The current implementation of Ask
Chef relies on schema.org microdata
for the Web page being accessed. This
markup is used to extract ingredients
and preparation instructions. Important extensions include generalizing
to content that is not enriched with
such data and integrating additional
content to augment the recipe (for
example, refinements from user comments3). Recommending assistance
for the current step in the task (
including instructional content: videos, Web
pages, and so forth), while also considering the previous steps, assistance
already offered, and future steps. Determining how to utilize wait times between steps in recipe preparation (for
example, “bake for 20 minutes”) can
be challenging, and users may elect to
spend that time in different ways (from
food preparation to unrelated tasks
[such as email, social media, and other
activities]). 6 Beyond task guidance, digital assistants could also provide prepreparation support (for example, add
items to a shared grocery list) and post-preparation support (for example, help
revisit favorite recipes).
MDXs enable many new scenarios. Sup-
port for guided task completion can
extend beyond cooking to include pro-
cedural tasks such as home improve-
ment, shopping, and travel planning.
Other scenarios such as homework
assistance, games, puzzles, or calen-
dar management could be enhanced
via MDXs. Support for these scenarios
could be authored by third parties. For
example, educators could compose or
flag visual/multimedia content to ac-
company tutorial/quiz materials, to
c See https://amzn.to/2cDSN3K
d See https://bit.ly/2NC7VnF
As experience with devices such as the
Amazon Echo Show has demonstrated,
augmenting voice-based digital assis-
tants with a screen can also enable new
scenarios (for example, “drop ins”—im-
promptu video calls). This adds value
even though the screen is small; more
would be possible with a larger, higher-
resolution display that could be located
as far from the smart speaker as needed.
The user-facing camera (webcam, infra-
red camera) on many laptops and tab-
lets can add vision-based skills such as
emotion detection and face recognition
to smart speakers. Powerful processors
in tablets and laptops enable on-device
computation to help address privacy con-
cerns associated with handling sensitive
image and video data.
Multi-device digital assistance is
not limited to a single, static device
pairing. For example, it includes scenarios such as dynamically connecting a smartphone and any one of many
low-cost smart speakers as users move
around a physical space; imbuing, say,
any Amazon Echo Dot with the capabilities of an Echo Show. Although we
targeted MDXs comprising two devices, there are situations where three or
more could be used (for example, adding a smartphone to Ask Chef for timer
tracking and alerting); these experiences must be carefully designed to avoid
overwhelming users. Multi-device interactions can also help correct errors
in speech recognition and yield useful
data to improve voice interfaces. 9
In sum, MDXs unlock a broad range
of more sophisticated digital assistant
scenarios than are possible with a single
device or via CDXs. Utilizing complemen-
advantage of MDXs
is that people can
get support now,
by pulling together
tary devices simultaneously could lead to
more efficient task completion on current tasks, cost savings for device consumers, and unlock new classes of digital
assistant skills to help people better perform a broader range of activities.
1. Carney, R.N. and Levin, J.R. Pictorial illustrations still
improve students’ learning from text. Educational
Psychology Review 14, 1 (Jan. 2002), 5–26.
2. Dong, T., Churchill, E. F., and Nichols, J. Understanding
the challenges of designing and developing multi-device
experiences. In Proceedings of the 2016 ACM Conference
on Designing Interactive Systems (2016), 62–72.
3. Druck, G. and Pang, B. Spice it up? Mining refinements
to online instructions from user generated content. In
Proceedings of the 50th Annual Meeting of the Association
for Computational Linguistics, (2012), 545–553.
4. Jokela, T., Ojala, J. and Olsson, T. A diary study on
combining multiple information devices in everyday
activities and tasks. In Proceedings of the 33rd Annual
ACM Conference on Human Factors in Computing
Systems, (2015), 3903–3912.
5. Kiddon, C. et al. Mise en Place: Unsupervised
interpretation of instructional recipes. In Proceedings
of Empirical Methods on Natural Language Processing,
6. Müller, H., Gove, J., and Webb, J. Understanding tablet
use: A multi-method exploration. In Proceedings
of the 14th International Conference on Human-Computer Interaction with Mobile Devices and
Services (2012), 1–10.
7. Pardal J.P. and Mamede N.J. Starting to cook a
coaching dialogue system in the Olympus framework.
In Proceedings of the Paralinguistic Information
and Its Integration in Spoken Dialogue Systems
8. Segerståhl, K. Crossmedia systems constructed
around human activities: A field study and implications
for design. In Proceedings of the IFIP Conference on
Human-Computer Interaction (2009), 354–367.
9. Springer, A. and Cramer, H. Play PRBLMS: Identifying
and correcting less accessible content in voice
interfaces. In Proceedings of the ACM SIGCHI
Conference on Human Factors in Computing Systems
10. Sørensen, H. et al. The 4C framework: Principles of
interaction in digital ecosystems. In Proceedings of
the 2014 ACM International Joint Conference on
Pervasive and Ubiquitous Computing, (2014), 87–97.
11. Weiser, M. The computer for the 21st century. Scientific
American Special Issue on Communications,
Computers and Networks, (1991), 94–104.
12. White, R. W. Skill discovery in virtual assistants.
Commun. ACM 61, 11 (Nov. 2018), 106–113.
Ryen W. White ( firstname.lastname@example.org) is Partner
Research Manager at Microsoft Research AI, Redmond,
Adam Fourney ( email@example.com) is Senior
Researcher at Microsoft Research AI, Redmond, WA, USA.
Allen Herring ( firstname.lastname@example.org) is Principal
Research Engineer at Microsoft Research AI, Redmond,
Paul N. Bennett ( email@example.com) is Senior
Principal Research Manager at Microsoft Research AI,
Redmond, WA, USA.
Nirupama Chandrasekaran ( firstname.lastname@example.org) is
Principal Research Engineer at Microsoft Research AI,
Redmond, WA, USA.
Robert Sim ( email@example.com) is Principal Applied Science
Manager at Microsoft Research AI, Redmond, WA, USA.
Elnaz Nouri ( firstname.lastname@example.org) is Senior Applied
Scientist at Microsoft Research AI, Redmond, WA, USA.
Mark J. Encarnación ( email@example.com) is
Principal Development Manager at Microsoft Research AI,
Redmond, WA, USA.
Copyright held by authors.