P
former. Our recent work [ 3] has been
examining hours of Amazon Echo use
from domestic settings. The Echo is a
speech-enabled smart speaker from
Amazon that uses the Alexa Voice
Service. Like other offerings from
Google or Apple, the Echo is designed
to play music, answer questions, and
help with functions such as cooking,
calendars, and shopping. The Alexa
service itself is also being integrated
into familiar household appliances
and smart home items (e.g., the
AmazonBasics microwave, the Nest
Learning Thermostat, and the Nest
Hello video doorbell), with Alexa
acting as a gateway to a household
Internet of Things.
01 Nikos Alexa
02 (2.6)
03 Isabel play some New Year’s music
04 (1.7)
05 Alexa here’s a station for jazz music (.) instrumental jazz.
06 (1.4)
07 ((music starts playing))
08 (4.4)
09 Isabel Al(h)exa this is not what we w(h)anted
10 ((laughter))
11 Nikos Alexa: (0.8) shut up.
12 (0.8)
13 Isabel H E:Yuh (0.5) Alex(h)a (.) Nikos apologises for being so
14 rude
15 Alexa hi there
16 (2.2) ((music is still playing))
17 Nikos Alexa stop (0.7) stop
18 ((music stops))
As part of the study, an Echo was
deployed in five households for a month
at a time along with a custom-built
recording device (a Conditional Voice
Figure 1. Fragment 1 of interactions between Isabel, Nikos, and Alexa.
Our study is not designed as a
reflection on Amazon Echo or voice
interfaces—there are emerging
critiques of voice assistants including
discussions around their gendered or
biased character [ 5, 6] connected with
concerns of inbuilt bias in the training
data they draw upon. Instead, we are
interested in delving deeper into how
participants in the study encountered
and dealt with interactional trouble.
Recorder or CVR; https://github.
com/MixedRealityLab/conditional-voice-recorder) that records audio
continuously from an embedded
conference microphone but retains
only the last minute in a temporary
buffer. The CVR operated in parallel
with its own speech recognition trained
for detecting the wake word (in this
case, “Alexa”), meaning we were able to
store a minute before and minute after
periods of Echo use and thus capture
something of the circumstances
leading up to and following that
use. Members of the participating
households could see when the CVR
was recording and choose to turn it off
with the press of a button.
we identify alternative concepts to
conversation when considering the
design of voice interfaces.
VOICE INTERFACES ARE
EMBEDDED IN THE MORAL
ORDER OF EVERYDAY LIFE
Here we present a set of short
transcribed fragments from our data.
While troubles are a routine feature of
everyday conversation [ 7], many kinds
of trouble encountered by users of
voice interfaces are unlikely to entirely
disappear as a function of incremental
advances in underlying technologies;
instead, they often rest upon improving
design understanding first. The ways
in which troubles are encountered and
dealt with turn out to be quite revealing
and, we hope, offer opportunities
for conceptual development around
what it means to design interactions
with conversationalists. We explore
these troubles in two ways: First, we
examine how revealing they are of the
social organization and moral order of
the everyday home environments that
these devices sit within. Second, driven
by comparing moments of trouble,
Perhaps the most obvious thing we
notice about participants’ interactions
with Alexa is how they become
embedded in the complex yet highly
ordered life of the home. The world
these interactions are going into
is built upon everyday and largely
unstated shared understandings about
how things normally proceed as well
as the concomitant moral organization
of those understandings. With our
first fragment we will begin to unpack
these ideas.
We adopt an ethnomethodological
conversation-analysis approach [ 4]
concerned with how members of social
settings—as lay sociologists—treat
one another’s activities as primordially
In Fragment 1 (Figure 1), we join
Nikos and Isabel. Nikos is hosting a
New Year’s party and is trying to get the
Echo he was given as part of the study to
play some suitable music.
social actions. For this article, a critical
point is that talk is action. Language does
things. When we talk, we are trying to
get something done, and done together.
Two issues need clearing up. The broader conversation about conversation conflates different
uses of the word; here, we are talking about the application of conversation in the sense of
literal verbal utterances to and around speech-detecting and dialogue-managing technology.
We are not discussing design approaches that might be styled “conversational” (perhaps the
latest metaphor with which to sell design work). Second, we need to recognize that the primary
enabling force of voice interfaces’ spread resides in significant deep-learning-driven advances
that have been made on the recognition side of these systems (speech to text in particular). The
dialogue side is a different story altogether, and therein lies a major challenge, although from a
user’s point of view, the technical distinction is meaningless.
Fragment 1. Nikos and Isabel
jointly produce the first instruction
to Alexa: to “play some New Year’s
music.” Alexa responds (line 05),
and Isabel’s negative assessment of
this response is that the music is “not
what we wanted,” further reinforced
by her laughter. Now, as competent
conversationalists, people work within
the complexity of categorization
routinely [ 8]. It is not categories of
genre or artist or song Isabel is asking
for—which tend to work more easily
as search keywords—but rather
a set of quite disparate songs that