0% found this document useful (0 votes)
195 views

Oreilly Design For Voice Interfaces

Expertise in UX design, design for the IoT, data-informed design, and more http://www.oreilly.com/design/free/

Uploaded by

Harmony Jorden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views

Oreilly Design For Voice Interfaces

Expertise in UX design, design for the IoT, data-informed design, and more http://www.oreilly.com/design/free/

Uploaded by

Harmony Jorden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Design for Voice Interfaces

Laura Klein
Design for Voice Interfaces
by Laura Klein
Copyright 2016 OReilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
OReilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://safaribooksonline.com). For
more information, contact our corporate/institutional sales department:
800-998-9938 or [email protected].

Acquisitions Editor: Mary Treseler Interior Designer: David Futato


Editor: Angela Rufino Cover Designer: Randy Comer
Production Editor: Matthew Hacker Illustrator: Rebecca Demarest
Copyeditor: Octal Publishing, Inc.

October 2015: First Edition

Revision History for the First Edition


2015-10-12 First Release

While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi
bility to ensure that your use thereof complies with such licenses and/or rights.

978-1-4919-3458-6
[LSI]
Table of Contents

1. Design for Voice Interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


A (Very) Brief History of Talking to Computers 1
A Bit About Voice and Audio Technology 4
VUI versus GUI: Whats New and Whats Not 5
Pure VUI versus Multimodal Interfaces 15
How Long Until Star Trek??? 26
Resources 30

vii
CHAPTER 1
Design for Voice Interfaces

The way we interact with technology is changing dramatically again.


As wearables, homes, and cars become smarter and more connec
ted, were beginning to create new interaction modes that no longer
rely on keyboards or even screens. Meanwhile, significant improve
ments in voice input technology are making it possible for users to
communicate with devices in a more natural, intuitive way.
Of course, for any of this to work, designers are going to need to
learn a few things about creating useful, usable voice interfaces.

A (Very) Brief History of Talking to Computers


Voice input isnt really new, obviously. Weve been talking to inani
mate objects, and sometimes even expecting them to listen to us,
for almost a hundred years. Possibly the first voice-activated
product was a small toy called Radio Rex, produced in the 1920s
(Figure 1-1). It was a spring-activated dog that popped out of a little
dog house when it heard a sound in the 500 Hz range. It wasnt
exactly Siri, but it was pretty impressive for the time.
The technology didnt begin to become even slightly useful to con
sumers until the late 1980s, when IBM created a computer that
could kind of take dictation. It knew a few thousand words, and if
you spoke them very slowly and clearly in unaccented English, it
would show them to you on the screen. Unsurprisingly, it didnt
really catch on.

1
Figure 1-1. Radio Rex.

And why would it? Weve been dreaming about perfect voice inter
faces since the 1960s, at least. The computer from Star Trek under
stood Captain Kirk perfectly and could answer any question he
asked. HAL, the computer from 2001: A Space Odyssey, although not
without one or two fairly significant bugs, was flawless from a
speech input and output perspective.
Unfortunately, reality never started to approach fiction until fairly
recently, and even now there are quite a few technical challenges
that we need to take into consideration when designing voice
interfaces.
Quite a bit of progress was made in the 1990s, and voice recognition
technology improved to the point that people could begin using it
for a very limited number of everyday tasks. One of the first uses for
the technology were voice dialers, which allowed people to dial up
to ten different phone numbers on their touch-tone phones just by

2 | Chapter 1: Design for Voice Interfaces


speaking the persons name. By the 2000s, voice recognition had
improved enough to enable Interactive Voice Response (IVR) sys
tems, which automated phone support systems and let people con
firm airplane reservations or check their bank balances without talk
ing to a customer-support representative.
Its not surprising that when Siri first appeared on the iPhone 4S in
2011, many consumers were impressed. Despite her drawbacks, Siri
was the closest we had come to asking the Star Trek computer for
life-form readings from the surface of the planet. Then IBMs super
computer, Watson, beat two former champions of the gameshow
Jeopardy by using natural-language processing, and we moved one
step closer to technology not just recognizing speech, but really
understanding and responding to it.
Toys have also come a long way from Radio Rex. The maker of the
iconic Barbie doll, Mattel, unveiled a prototype of Hello Barbie in
February of 2015 (Figure 1-2). She comes with a WiFi connection
and a microphone, and she can have limited conversations and play
interactive, voice-enabled games.

Figure 1-2. Hello Barbie has a microphone, speaker, and WiFi


connection.

From recognizing sounds to interpreting certain keywords to under


standing speech to actually processing language, the history of
designing for voice has been made possible by a series of amazing
technological breakthroughs. The powerful combination of speech

A (Very) Brief History of Talking to Computers | 3


recognition with natural-language processing is creating huge
opportunities for new, more intuitive product interfaces.
Although few of us are worried about Skynet (or Barbie) becoming
sentient (yet), the technology continues to improve rapidly, which
creates a huge opportunity for designers who want to build easier-
to-use products. But, its not as simple as slapping a microphone on
every smart device. Designers need to understand both the benefits
and constraints of designing for voice. They need to learn when
voice interactions make sense and when they will cause problems.
They need to know what the technology is able to do and what is
still impossible.
Most important, everybody who is building products today needs to
know how humans interact with talking objects and how to make
that conversation happen in the most natural and intuitive way
possible.

A Bit About Voice and Audio Technology


Before we can understand how to design for voice its useful to learn
a little bit about the underlying technology and how its evolved.
Design is constrained by the limits of the technology, and the tech
nology here has a few fairly significant limits.
First, when we design for voice, were often designing for two very
different things: voice inputs and audio outputs. Its helpful to think
of voice interfaces as a conversation, and, as the designer, youre
responsible for ensuring that both sides of that conversation work
well.
Voice input technology is also divided into two separate technical
challenges: recognition and understanding. Its not surprising that
some of the very earliest voice technology was used only for taking
dictation, given that its far easier to recognize words than it is to
understand the meaning.
All of these thingsrecognition, understanding, and audio output
have progressed significantly over the past 20 years, and theyre still
improving. In the 90s, engineers and speech scientists spent thou
sands of hours training systems to recognize a few specific words.
These are known as finite state grammars because the system is
only capable of recognizing a finite set of words or phrases. You can

4 | Chapter 1: Design for Voice Interfaces


still see a lot of these in IVRs, which are sometimes known as those
annoying computers you have to talk to when you call to change
your flight or check your bank balance.
As the technology improves, were building more products with
statistical language models. Instead of a finite set of specific words
or phrases, the system must make decisions about how likely it is
that a particular set of phonemes resolves to a particular text string.
In other words, nobody has to teach Siri the exact phrase Whats
the weather going to be like in San Diego tomorrow? Siri can prob
abilistically determine how likely it is that the sounds coming out of
your mouth translate into this particular set of words and then map
those words to meanings.
This sort of recognition, along with a host of other machine-
learning advances, has made Natural-Language Processing (NLP)
possible, although not yet perfect. As NLP improves, we get
machines that not only understand the sounds were making but
also understand the meaning of the words and respond appropri
ately. Its the kind of thing that humans do naturally, but that seems
borderline magical when you get a computer to do it.

VUI versus GUI: Whats New and Whats Not


These recent technological advances are incredibly important for
voice user interface (VUI) designers simply because they are making
it possible for us to interact with devices in ways that 10 or 20 years
ago would have been the stuff of science fiction. However, to take
full advantage of this amazing new technology, were going to have
to learn the best way to design for it. Luckily, a lot of the things that
are core to user experience (UX) design are also necessary for VUI
design. We dont need to start from scratch, but we do need to learn
a few new patterns.
The most important part of UX design is the useryou know, that
human being who should be at the center of all of our processes
and luckily thats no different when designing for voice and audio.
Thomas Hebner, senior director of UX design practice and profes
sional services product management at Nuance Communications,
has been designing for voice interfaces for 16 years. He thinks that
the worst mistakes in voice design happen when user goals and
business goals dont line up.

VUI versus GUI: Whats New and Whats Not | 5


Great products, regardless of the interaction model, are built to
solve real user needs quickly, and they always fit well into the con
text in which theyre being used. Hebner says, We need to practice
contextually aware design. If I say, Make it warmer in my house,
something should know if I mean the toast or the temperature. That
has nothing to do with speech recognition or voice design. Its just
good design where the input is voice.
This is important. Many things about designing for voiceunder
standing the user, knowing the context of use, and ensuring that
products are both useful and usableare all exactly the same as
designing for screens, or services, or anything else. Thats good news
for designers who are used to building things for Graphical User
Interfaces (GUIs) or for systems, because it means that all of the
normal research and logic skills transfer very nicely when incorpo
rating speech into designs. If you understand the basic User-
Centered Design process and have applied it to apps, websites, sys
tems, or physical products, many of your skills are completely
transferrable.
Yet, there are several VUI-specific things that you wont have run
into when designing for other sorts of interactions, and theyre
important to take into consideration.

Conversational Skills
Content and tone are important in all design, but when designing
for speech output, it takes on an entirely new meaning. The best
voice interface designs make the user feel like shes having a perfectly
normal dialog, but doing that can be harder than it sounds. Prod
ucts that talk dont just need to have good copy; they must have
good conversations. And its harder for a computer to have a good
conversation than a human.
Tony Sheeder, senior manager of user experience design at Nuance
Communications, has been with the company for more than 14
years and has been working in voice design for longer than that. As
he explains it:
Each voice interaction is a little narrative experience, with a begin
ning, middle and an end. Humans just get this and understand the
rules naturallysome more than others. When you go to a party,
you can tell within a very short time whether another person is easy

6 | Chapter 1: Design for Voice Interfaces


to talk to. Until recently, speech systems were that guy at the party
doing everything wrong, and nobody wanted to talk to them.
While many early voice designers have a background in linguistics,
Sheeders background was originally writing scripts for interactive
games, and it helped him write more natural conversations. But,
designing for voice communication wasnt always successful. Early
voice interfaces often made people uncomfortable because the
designers felt as if people would need explicit instructions. Theyd
say things like, Do you want to hear your bank balance? Please, say
yes or no. This violates basic rules of conversation. Sheeder felt that
these interfaces made people feel strange because the IVR would
talk to you like it was human, but would instruct you to talk to it like
a dog. It was like talking to a really smart dog.

Designing for better conversational skills


Many designers argue that copywriting is an integral part of the user
experience, and we should be better at it. Thats absolutely the case
for voice and speech design. If you want to incorporate voice inter
actions in your products, youre going to need to learn to make them
sound right, and that means learning a few important rules.
Keep it short, but not too short
Marco Iacono, who designs products at Viv Labs,, explains,
When using text-to-speech, the experience can become frus
trating if the system is too chatty. Especially in hands-free sce
narios, the system must be concise and the user should control
the pace of the interaction. In part, that can mean writing dia
logs that are short, but not too short. Marco knows what hes
talking about. Before his present position at Viv Labs, he spent
several years as a Siri EPM at Apple where he worked on iOS,
CarPlay and Apple Watch.
Written language is fundamentally different from spoken.
When you first start writing dialogs, you might find that they
sound stilted or just too long when spoken out loud by the
product. Thats normal. You want to keep all utterances much
shorter than youd expect. If you dont, people will become frus
trated and begin cutting off the system, potentially missing
important information.
On the other hand, you need to be careful not to omit anything
really critical. Sheeder talked about the early days of voice

VUI versus GUI: Whats New and Whats Not | 7


design for call-center automation, when the entire goal was to
keep everything as short as possible. There was a belief that
shaving 750 milliseconds off a call would increase efficiency.
But, by shaving off connector words and transitions, it actually
increased the cognitive load on the user and lowered perceived
efficiency. When the responses became too fast, it put more
pressure on listeners, and they would grow frustrated or con
fused because they couldnt process the information. It ended up
making the call centers less efficient.
Create a personality
People treat things that talk back to them as humans, and
humans (most of them, anyway) have fairly consistent personal
ities. The same is true of VUIs. Siri has a different personality
from Microsofts Cortana, and theyre both different from the
Amazon Alexa.
Karen Kaushansky, director of experience at a stealth startup,
has worked in voice technology since she began working at
Nortel in 1996. She explains that successful voice interfaces have
personas that are interesting, but also goal-based. Are you look
ing to get through tasks quickly? To encourage repeat engage
ment? Different voice personas have different effects for the
user.
Having a consistent personality will also help you to design bet
ter dialogs. It helps you make decisions about how your inter
face will talk to the user. In many ways, a voice persona is simi
lar to a style guide for a visual product. It can help you decide
what tone and words you should use. Will your interface be
helpful? Optimistic? Pushy? Perky? Snarky? Fun? Again, it all
depends on what the goals are for your product and your user.
Whatever the choice, remember that both you and your users
are going to have to live with this particular interface for a very
long time, so make sure its a personality that doesnt become
grating over time.
One thing to consider when youre building a personality is how
human youre going to make it. Marco Iacono warns that,
Theres a sliding scale from purely functional to anthropomor
phic. As you get closer to the anthropomorphic end of the scale,
user expectations grow tremendously. Instantly, people expect it
to understand and do more. The risk of making your products

8 | Chapter 1: Design for Voice Interfaces


personality seem very human is that your users might be disap
pointed and frustrated as soon as they find the limitations of the
system.
Listen to yourself
To ensure that your conversations sound natural and efficient
(not irritating), youre going to need to do a lot of testing. Of
course, you should be usability testing your designs, but before
you even get there, you can begin to improve your ability to
write for voice interfaces. Abi Jones, an interaction designer at
Google who does experimental work with voice interfaces and
the Internet of Things (IoT), suggests role playing the voice UI
with someone else in order to turn it into a real dialog and listen
to how it sounds. She then uses accessibility tools to listen to her
computer reading the dialog.
Of course, none of these rules are entirely different from things we
encounter in designing for screens or services. When were writing
for any product, we should maintain a constant tone and keep it
short and usability test everything, too. These are all skills we need
as UX designers in any context. However, it does take a few adjust
ments to apply these patterns when speech is the primary method of
input and output.

Discoverability and Predictability


Discoverability and predictability are definitely concerns when
youre designing for interfaces for which the primary input method
is voice, especially if youre taking advantage of NLP. This makes a
lot of sense when you consider the difference between a visual inter
face and a voice interface.
Natural-language interfaces put the entire burden of deciding what
to ask for on the user, while visual interfaces can give the user con
text clues such as interrogatory prompts or even explicit selection
choices. When you go to your banks website, youre often presented
with several options; for example, whether you want to log in or
learn more about opening an account or find a branch.
Imagine if your bank was more like Google (Figure 1-3). You just
went to the site and were given a prompt to ask a question. Some
times that would work fine. If you wanted to check your balance or
order checks, it might be much easier to do as a conversation. I
need new checks. Great, whats your account number? and so on.

VUI versus GUI: Whats New and Whats Not | 9


Figure 1-3. Ok Google, tell me about unicorns.

But, what if you thought you wanted to open a new business account
that was tied to your old savings account, and there were several
options to choose from, each with different fee structures and
options? Thats a much harder conversation to start, because you
might not even know exactly what to ask for. You might never even
realize that the business plans existed if you didnt know to ask for it.
This sort of discoverability is a serious problem when designing for
open prompt voice interfaces. When Abi Jones first began designing
for voice, she carried around a phony voice recorder and treated it
like a magic device that could do whatever she wanted it to do. It
made me realize how hard it was to say what I wanted in the world,
she says.
Even in voice interfaces that limit inputs and make functionality
extremely discoverablelike IVRs that prompt the user to say spe
cific wordsdesigners still must deal with a level of unpredictability
in response that is somewhat unusual when designing for screens.
Most of our selections within a visual product are constrained by the
UI. There are buttons or links to click, options to select, sliders to

10 | Chapter 1: Design for Voice Interfaces


slide. Of course, there is occasional open-text input, but thats
almost always in a context for which it makes sense. When you type
anything into the search box on Google, youre doing something
predictable with that information, even if the input itself is
unpredictable.
Siri, on the other hand, must decide what to do with your input
based on the type of input. Does she open an app? Search the web?
Text someone in your contacts list? The unpredictability of the input
can be a tricky thing for designers to deal with, because we need to
anticipate far more scenarios than we would if we constrained the
users input or even let the user know what he could do.

Designing for better discoverability and predictability


If you want to make features within your voice interface more dis
coverable, one option is to make your interface more proactive.
Instead of forcing users to come up with what they want all on their
own, start the conversation.
Karen Kaushansky thinks that Cortana does this especially well. If
youre in the car with headphones on and you get a text message,
Cortana knows youre driving and announces the text message and
asks if you want it read. It wont do that if your headphones arent in,
because it might not be private. It knows the context, and it starts
the dialog with you rather than making you request the conversation
be started.
By triggering user prompts based on context, like Cortana does, you
can help users discover features of your interface that they might not
otherwise know existed. In this case, the user learns that text mes
sages can be read aloud.
The other option is simply to explain to users what they should say.
Many IVRs that tried NLP have now gone back to giving users
prompts. For example, instead of asking, What do you need help
with today? your banks telephone system might say something like,
What do you need help with? You can say Bank Balance, Order
New Checks, Transfer Money, etc. Kaushansky points out that in
some cases, even though the technology is more primitive, its easier
for users. Using You can say... can be better. Otherwise people
dont know what to say.

VUI versus GUI: Whats New and Whats Not | 11


Privacy and Accessibility
One of the most troubling aspects of voice interfaces, especially
voice-only, is the obvious fact that everything might be audible.
Now, thats probably fine when asking Alexa to play you some show
tunes (Figure 1-4), but its less fine when youre at work in an open
plan office trying to access your health records. Again, context is
everything.
Rebecca Nowlin Green, principal business consultant at Nuance
Communications, helps Nuances clients define their customer serv
ices experiences by incorporating speech recognition and other self-
service technologies. She explains that well-designed voice interfaces
should always have a fall back input method for any sensitive
information.
Accessibility can also be an issue. Although voice recognition is
quite good, it can be significantly reduced when dealing with non-
native speakers, background noise, or even a bad phone connection
in the case of IVRs. Abi Jones pointed out that you need to shout
louder than the music playing on the Amazon Alexa to turn the vol
ume down. The environment in which youre interacting with a
product can have a huge impact on accessibility and ease of use.
Conversely, better voice UIs and audio output can increase the
accessibility of products for people with poor vision or who have
trouble typing or tapping on mobile screens. Smart homes can make
everyday tasks easier for people with limited mobility by allowing
access to devices without having to physically access them.

12 | Chapter 1: Design for Voice Interfaces


Figure 1-4. Amazon Alexa and companion iPad app.

Designing for better privacy and accessibility


The critical pieces of designing for better privacy are understanding
what your users consider private information and predicting the
context in which that information will be used. Sure, nobody wants
to yell a social security number out loud on the train. But even email
subject lines or the names of people who are sending text messages
can be embarrassing in some situations.

VUI versus GUI: Whats New and Whats Not | 13


Giving the user the ability to easily take a device out of voice mode
and accommodating alternate methods of input and output are cru
cial to improving privacy. A more responsive audio output system
would also help. Abi Jones points out that, When you whisper to a
human, theyll often whisper back. We adjust instantly to cues from
the people around us. Our devices dont get this, so a mobile phone
always talks back at the same volume as the ringtone. Thats not nec
essarily what you want. Creating voice interfaces that take into
account things such as ambient noise, location, and the volume of
the input could dramatically improve privacy outcomes.
Better understanding of context can also help with accessibility
issues. Nazmul Idris, an ex-Googler and founder of the startup
TRNQL, is working on an SDK to make it easier for designers to
take context into account when creating input and output systems.
In one of the early sample apps being developed with TRNQLs tech
nology, he showed how when the user is sitting still, the default
input method is keyboard, but when the device senses that the user
is walking, the default input method switches automatically to voice.
In other words, if your phone knows youre seated at your desk, it
knows youd probably rather type. But if youre on the move, voice is
safer because you wont accidentally walk into traffic while looking
at your phone.
As our personal devices know more and more about uswhere we
live, where we work, when were at the movies, when were listening
to musicthey can make better decisions about how we might like
to interact with them. Of course, this can mean an uncomfortable
tradeoff for some users between privacy and accessibility. We might
appreciate the phone knowing when were walking, but we might
not want it storing information about where were walking. Making
good choices about these trade-offs is part of designing for any
mobile or personal device, and the additional privacy concerns asso
ciated with voice and audio technology can complicate the decision-
making process.
Just remember, privacy and accessibility concerns are serious and
never easy to handle, and you could potentially cause a lot of harm
by having a device announce something when it shouldnt. Under
standing your users concerns and contexts can help you to make
better decisions that will protect your users data while also making
things easier for them.

14 | Chapter 1: Design for Voice Interfaces


Pure VUI versus Multimodal Interfaces
While early VUIs were created to automate phone calls people were
making to companies, VUIs are now showing up in things like con
nected home devices, mobile phones, and wearables. The interesting
thing about these products is that most of them have multiple forms
of input and output. Instead of relying entirely on voice in and
audio out, devices might have small screens, flashing lights, com
panion apps, or small keyboards. These multimodal interfaces can
create some fascinating design challenges.
Unfortunately, as is the case with all exciting new technologies, com
panies often decide to use voice input for the wrong reasons. People
add voice because its cool. They switch to NLP because they think
people would prefer to just ask questions rather than be given
instructions. They remove screens or buttons because they want to
reduce clutter in the interface.
The only good reason to add voice as an input method or audio as
an output method is because it makes the product better for the
user. As with any other part of design, voice design should serve the
needs of the user and solve a specific problem. Using voice and
audio can be a powerful design tool, but used badly, it will only
make your product worse.
You need to determine how a conversational or voice interaction
will improve the base experience for your use cases, Marco Iacono
explains. Are you simplifying something that would otherwise take
six or eight taps? Are you accelerating the user deep into a task?
When you think about it, some pieces of information are really easy
to say, but theyre hard to type, and vice versa. The same goes for
output.
It might be convenient to say, Give me all of the restaurants South
of Market in San Francisco that serve brunch and are open now, but
you probably wouldnt want Siri to read the names of all 760 out
loud to you. In that case, voice input and screen output makes the
most sense as an interaction.

Pure VUI versus Multimodal Interfaces | 15


On the other hand, finding a parking structure near where youre
driving might be easier by looking at Google Maps, but youll proba
bly want to be read the directions as you go, so you dont miss a
turn. In that case, screen input coupled with voice output is very
useful. If you also wear a smart watch, it may buzz on your wrist
when a turn is coming up, adding haptic feedback to the mix.
Abi Jones had a very specific moment in time when she realized that
she wanted alternate forms of input. I was changing a diaper and
really needed to know if what I was looking at was a diaper rash, and
I didnt want to get my phone out. I realized that having an interface
I didnt have to touch right then would be really helpful.
So, as a designer, how do you tackle this? How do you decide on
when to use different types of input or output. As Tony Sheeder
explains it:
Making decisions about using voice input and audio output is very
device, task, and context specific. For example, you shouldnt look
at a screen in a car, so you might use a voice interface to control
navigation. But you dont want to change the steering wheel to
voice control, either. You want to take advantage of standards that
already exist.

Consider Using Voice and Audio for...


Shared interfaces in smart homes
The most obvious use for voice input is for products that dont
have screens or at least dont have screens nearby. Home auto
mation devices such as the Nest thermostat (Figure 1-5), which
has a small screen, and Philips Hue lightbulbs, which dont have
any screens, have companion apps that users can install on
phones or tablets. However, voice interfaces to products like
these are significantly more useful than companion apps, in
many ways. Thats one of the reasons why the Philips Hue light
bulb already has an integration with the Amazon Alexa.

16 | Chapter 1: Design for Voice Interfaces


Figure 1-5. Nest, Im cold!

Abi Jones explains, Voice interfaces arent good at distinguish


ing speakers, which means that, by default, voice UIs become
usable for everyone in a space. Anyone can control the Amazon
Alexa in the room. Having voice interfaces for smart home
devices makes them more accessible to everybody, including
guests, children, or people who just dont want to have to pull
out their phone every time they want to turn on the lights.
Voice interfaces could also give people a way to control devices
without being tethered to them. You can change or monitor the
oven temperature without leaving the living room or make sure
that all the lights are off in the house from bed.
Languages that are hard to type
This one might seem perfectly obvious, but some languages are
easier to type than others. Providing people with the means to
speak those words rather than type them can save a tremendous
amount of time.

Pure VUI versus Multimodal Interfaces | 17


But, even if youre designing in English or any other fairly stan
dard language, dont forget that there are certain types of words
or concepts that people need to input that arent necessarily easy
to represent. Mathematical formulas, musical notes, or chemical
bonds are very easy to say, but require very specific notation
that can be difficult to input.
Complicated things that people can articulate
In fact, voice interfaces can be used well whenever we require
complicated input from users that is easier to speak than to
type. Suppose that you want to watch a movie tonight. You
could go to your television listings or Netflix or Amazon Video
and start to flip through the seven thousand choices in your
queue. Or you could try to filter it based on their preset
categories.
Or you could say, Give me comedy movies without Adam San
dler, available tonight after 8 pm, for free. Only show the ones
with four or more stars. Wouldnt that be better?
Complex queries with multiple inputs and filters are often easier
for users to describe in words, which is what makes Google
Search so powerful, whether the user is speaking or typing.
Combining that power with voice input really shines, though,
when youre searching on a device like a television that doesnt
have a handy keyboard.
Safety in hands- or eyes-free environments
One place that voice interfaces have already begun to be used is
in cars. Whether we like it or not, people will interact with devi
ces in carswhether its navigation or making a phone call or
textingand most of us would really prefer that these drivers
kept their hands on the wheel and eyes on the road.
Voice input and audio output make it possible for people to
interact more safely with devices. It wont prevent distracted
driving, but at least it keeps people from staring at a screen, and
that might be the best well get until the cars can drive them
selves.
Its not just in cars, though. Consider an operating room. Obvi
ously we dont want our doctor texting a friend while cutting
into us, but more and more devices are making their way into
hospitals, and a voice interface with which a surgeon can

18 | Chapter 1: Design for Voice Interfaces


quickly access information from a medical record while operat
ing could save lives.
Whenever were in a situation in which we might need to call up
information or respond quickly to something without using our
hands or eyes, voice interfaces and audio outputs can increase
safety and efficiency.

Dont Use Voice and Audio for...


Anything requiring negotiation or a lot of variables
Although speech recognition and interpretation technology is
rapidly improving, were still a long way from being able to have
a real conversation with a device. Most successful interactions
are one, or at most two, sentences long. Abi Jones points out
that there is a slight conversational nature to the interactions
with the Amazon Alexa, but even that is limited. She says:
If you ask for a radio station that doesnt exist, it will ask you if
you want to create it. But if you want to skip a song and turn
up the volume, those are two completely separate actions, and
they have to be performed separately.
This means that longer conversations that might involve multi
ple questions or negotiations are not great candidates for voice
input. Scheduling a single doctors appointment might be possi
ble, if annoying, with a voice interface. However, scheduling a
series of meetings where later ones relied on previous decisions
would likely be a disaster. Yet, doing that visually, by selecting
dates on a calendar, could be done quite easily.
Huge amounts of input or output
Voice input and audio output can be significantly slower than
text for large amounts of data, especially when most of that data
will be ignored. Remember the brunch restaurant in SOMA
example? Although it might be easier to verbally describe the
kind of place youre looking for, having all of those results read
back to you would be awful and slow. Similarly, quickly select
ing several items from a list is a much faster input process when
you can scan the list visually and select by tapping rather than
explaining to the device which items youd like to select.
When youre dealing with large amounts of input or output, vis
ual interfaces are almost always superior to audio or voice. This
is true for most people when dealing with text input, as well.

Pure VUI versus Multimodal Interfaces | 19


Even though some people are outstanding at dictating their
thoughts, it does take quite a bit of practice to be able to accu
rately speak an entire email without going back and editing it
later. Short texts and email responses are great for voice input,
but theres a reason this report was written with a keyboard. Its
just easier for most people.
Hard-to-describe input
In the previous section, we looked at languages or jargon thats
easier to describe than to type. There are also concepts that can
be difficult to describe out loud even though theyre simple to
represent visually.
Although there are many ways in which a voice interface might
be easier for interacting with a smart television, you almost cer
tainly wouldnt want to change the color balance by describing it
to the TV. You probably wouldnt want to change the side mir
ror settings on your car by talking to it either. Just because voice
input works for some interactions in a context, it doesnt mean
that its right for all of the interactions.
Comparing lists of complicated things
Another area for which voice interfaces fail miserably is in com
paring complicated lists of items. Picture four different comput
ers or smartphones arranged in a well-designed grid of features
and prices. Thats a very simple, understandable interface. Now,
instead, imagine all of those items being read to you by a
computer.
Even though there isnt an enormous amount of data, it still
requires the user to hear and remember several different
options at once, which becomes difficult very quickly, especially
with no visual cues. Theres a reason that stores have floor mod
els and labels and product sheets as well as sales associates.
Some information is simply easier to take in visually and
shouldnt be forced into a conversation, especially one with a
computer.

Successfully Combining Modes


Tony Sheeder was around when the first Dragon Mobile Assistant
was being developed. The Dragon Mobile Assistant is an app
designed by Nuance Communications that makes it possible for
users to be more hands-free with their phones. It uses outstanding

20 | Chapter 1: Design for Voice Interfaces


voice recognition technology to do things like set appointments,
send texts, post to social media, or check the weather. In other
words, its exactly what the name implies. But despite having an
eerily good understanding of what the user is asking for, the very
first version had an interesting design issue.
Sheeder explains that the voice interaction design and the visual
design of the phone were, by necessity, initially done separately by
different groups of designers. Both groups felt that they had to han
dle all the input and output by themselves. It ended up having a lot
of redundant information. It would show you something on the
screen and read it to you, Sheeder said. When the designers learned
to work together and began relying on one another, they figured out
which things made the most sense visually and which should be
handled with audio. Consequently, the next versions dealt with
input and output more naturally.
To successfully combine modes, its important to understand not
just when you should use voice and when you shouldnt, but how
you can effectively combine voice interfaces with other methods of
input and output.
There are several different types of multimodal experiences, includ
ing the most common models, which you can see in Table 1-1.

Table 1-1. Common multimodal interfaces


Input methods Output methods Examples
Voice, touchscreen Small screen, haptic Smart watches
Voice, keyboard, gestures, touchscreen Screen, audio, haptic Smartphones
Voice, companion app Audio, companion app Amazon Alexa
Voice, camera/gesture Screen, audio Xbox Kinect
Voice, scroll knob, hard buttons, touchscreen Audio, screen Some car interfaces

There will, undoubtedly, be more input and output combinations


over the next few years, and that means designers are going to need
to pay a lot of attention to understanding which input and output
methods to employ for the best usability. Rebecca Nowlin Green,
says that, when you add more than one input or output method
complexity just skyrockets.
There are some useful tips for figuring out which combination of
input and output methods is right for your product.

Pure VUI versus Multimodal Interfaces | 21


Pure voicefinite state
Finite-state, pure voice interfaces are things like classic IVRs. These
are the systems where you call in and hear a voice say, What can we
help you with? You can say, Check my balance, Open an account,
Order new checks, or Representative. At each point in the flow, the
system only understands those specific commands.
Although theyre not sexy, theyre still very commonly used by com
panies who want to reduce call-center costs by handling common
tasks and routing callers correctly for more complicated tasks.

When to use it? Finite-state, pure voice systems are still useful for cer
tain systems. Because the only input and output methods are voice
and audio, theyre going to be handy for products that dont have a
screen. This obviously includes IVR phone systems, but it could also
be a physical device like a screenless wearable device (Figure 1-6).
In general, youll use a finite-state system when your product is sim
ple enough that its not worth going for NLP. Theyre useful for
products for which users can be trained to do a very small number
of tasks. For example, a bedside clock that lets you set alarms doesnt
necessarily need a full NLP system. It just needs to understand pre
set commands such as, Set alarm, that users could memorize. The
same is true for the autodialer on a corporate phone system. Its not
handling open-ended queries. Its just recognizing a specific list of
names and directing calls.
One of the main problems with finite-state systemsand the reason
so many people hate most IVRsis that they often require users to
go through a labyrinth of prompts to get to the one thing they want.
If the system tries to handle too much, it can require a huge amount
of investment on the users part, only to end with having to talk to a
representative or being disconnected.

22 | Chapter 1: Design for Voice Interfaces


Figure 1-6. When can I start talking to my Fitbit?

Simple systems that handle just a few predictable tasks that users
might not know how to ask for naturally are good candidates for a
pure voice, finite-state interface. For example, cars audio system
might be fine for one. There are a limited number of things you
might want from it: play a song, turn up the volume, and so on. The
user interacts with it daily, so theyre more likely to use the same
vocabulary for the commands every time. Each command is simple
and discrete, so users wont get trapped. And finally, its very easy to
recognize and recover from a mistake.

Pure voiceNLP
As soon as the technology improved, many IVR systems moved to
NLP. This means that, when you call a company for help, you might
get a computer asking, What can I help you with today? after
which there is a very good possibility that it will recognize what

Pure VUI versus Multimodal Interfaces | 23


youve asked for and give it to you as long as you use exactly the
words it recognizes.
Whether or not you choose a pure voice system is, just like in the
previous section, probably determined by whether or not your
product has a screen. Whether you want NLP is a different question.
Rebecca Nowlin Green helps companies decide what sort of voice
interface is right for them. She usually recommends natural-
language understanding IVRs for high call volumes with a lot of
routing complexity. In other words, if there are a huge number of
different things that a user might be calling about, designing a flow
to quickly get users to the right destination can help skip a few layers
of questions and avoid errors.
Of course, when customers call about things such as health insur
ance or banking, they often dont know exactly the language they
need to use to get what they want. They might have difficulty
explaining to a computer, I want to know if you will pay me back
for this thing that my doctor says I need to have done. But, if you
give them the option to make a selection like, Get preapproved for
a medical procedure, they can recognize that as something that
sounds right. In this case, you might want to offer some directed
dialog to give the user clues as to what they can ask for. Just because
you opt to let the user to say anything doesnt mean that you have to
leave them with no suggestions about what they might want to say.

Voice input/visual output


These days, of course, were integrating voice input into more and
more products that also have accompanying screens. In many cases,
were allowing the user to give voice input, but providing visual out
put rather than audio.
Texting on smart watches is an excellent example of when to use this
combination. The watch faces are small enough to make any other
form of inputting text next to impossible, but the screens can easily
show the user the result of voice input, giving them a way to easily
check the recognition and recover from any errors.
Smart televisions would also be good candidates for this sort of
interface. They dont currently have an input method that lends itself
to complicated input, but they certainly have enough functionality
to make natural-language queries useful. Being able to say, Show
me all the times that this weeks So You Think You Can Dance is

24 | Chapter 1: Design for Voice Interfaces


being shown, would be significantly easier than searching for it by
using the arrow keys on the remote control.
Thomas Hebner says one of his favorite uses of a voice interface is a
popular pizza ordering app, with which the user can simply say an
order out loud. Its easier to say, Two small pizzas, one just cheese,
and one with pepperoni and mushrooms, than it is to make all of
those selections by tapping. The app confirms the order on the
screen so that the user can verify that everything is correct, which is
obviously faster and more pleasant than having the entire order read
aloud.
In fact, any interface that includes open-ended, complex input that
is easy for the user to speak out loud but that produces results that
would be unpleasant to have read back is the perfect candidate for
an interface that accommodates voice input and provides visual
output.

Physical input/audio output


This particular interface combination seems unusual, until you real
ize that its actually the traditional input for every single stereo
youve ever used. Physical input could be tapping on a mobile phone
screen, but it could also be physical buttons or even using gestures
in front of a camera.
The other versions of this are a little bit less common, although
things like self-checkout systems in grocery stores often have spo
ken instructions that are separate from anything shown on the
screen. Audio output can be useful for any product that will be used
infrequently by large numbers of untrained people. A reassuring
human voice giving instructions can help someone who otherwise
might struggle with a process such as purchasing groceries, for
which several tasks must be performed in a specific order.

A little of everything
Many products are moving to multimodal interfaces that combine
voice and physical inputs with screen and audio output. Navigation
apps might be the perfect example of a category of product that
combines all of these elements well.
Users can touch places on the map, scroll around to see whats
nearby, or type in an address using physical input. When driving,
they can simply say the name of a destination; this way, they dont

Pure VUI versus Multimodal Interfaces | 25


have to take their eyes off the road or hands off the wheel to change
the destination. Audio output makes it possible for the mobile
device to give clear navigating instructions, while the map shows
turns and other information, like traffic, that would be hard to
express verbally.
Its an outstanding combination of input and output methods, each
one contributing to the user experience in the way best suited to it.
Each input and output affordance in a well-designed navigation sys
tem takes into account the context and needs of the user. You can be
hands- and eyes-free when you have to be, but still have access to
the screen when its useful.
Designs like this dont just happen. Theyre carefully crafted based
on a deep understanding of how and when users interact with prod
ucts. Navigation systems are used in cars, so voice and audio become
obvious choices. Of course, not all products have such clear-cut
uses, so it can be challenging to decide when voice and audio inter
faces will improve the experience. Having a design framework for
making those decisions can be helpful.

How Long Until Star Trek???


As Abi Jones says, The more you do research on voice UIs, the
more it makes it utterly extraordinary how easy and fluid it is to
communicate with humans.
Unfortunately, this is true, and it means that it doesnt look good for
the Star Trek computer interface in the very near future. Despite
Watsons win on Jeopardy, were still a long way from having real
conversations with our smart toasters, and honestly, thats probably
for the best.
There are a few things holding us back from this future. Some of
them can be solved by technology, but others might only be solved
by humans growing more comfortable with the technology over
time.

26 | Chapter 1: Design for Voice Interfaces


The Problems We Still Face
None of this is easy. There are still some fundamental challenges
with even just the basics, such as getting yes/no recognition perfor
mance, Rebecca Nowlin Green says. But its getting better.
Some products in 1999 had around a 65 percent recognition rate,
whereas todays rates are closer to 92 percent. Nonetheless, that still
means that were talking to systems that dont understand us eight
percent of the time, which can be frustrating when were trying to
accomplish a task. After all, if you have not made their problem go
away, people wont use your technology.
Both systems and humans can have trouble with things like barge-
ins, which are the times when the computer is saying something
and the user talks over it in order to skip the directions. People have
trouble with interrupting because it can feel unnatural to cut off
somebody who is speaking. Devices have trouble with it because
they dont always catch the first part of the command, which leaves
shouting the same command over and over at an inanimate object
that keeps saying, Hmmm...I didnt get that.
Of course, as voice interfaces become more common, well see more
conflicts and confusion. Its generally pretty clear when youre typing
on a phone or pushing a button on your oven which device youre
interacting with. Thats less true of voice activated devices. Tony
Sheeder explains, If I say raise the temperature to 350 degrees, the
system should know if youre talking about the oven or the iGrill
and not the home thermostat. We shouldnt have a dozen different
ways of interacting with all the different systems, like we do now.
The same is true of smart watches and wearables. When I say, OK
Google I want my watch to respond, but not the four other Google
watches in the room. These arent problems that can be solved
entirely by better technology. At some point, we might all have some
sort of smart hub in our homes that route our instructions to the
various voice-controlled systems. We might have voice recognition
for our phones, wearables, and cars. But for the immediate future,
these problems will have to be solved by good, thoughtful, context-
aware design and by designers who realize the potential for conflicts.

How Long Until Star Trek??? | 27


The biggest problem we still face may be the human one. All the
experts agree that people still kind of hate voice interfaces. Abi Jones
says:
When humans talk to each other, we create a shared understanding
of the world through dialog. Were willing to forgive a lot. When
watching user studies and interactions with computers, someone
interacting with a voice interface might come into it being forgiv
ing, but when it reveals a lack of humanity, people start treating it
like a computer again.

The Future of Voice


So, what should the future look like? There are a huge number of
products that could be improved in the near term. Tony Sheeder
says:
TV interfaces are all awful. Making selections with a grid of letters
is awful. Speech just cuts through that whole thing. Also, cars can
be made a lot safer. Auto interfaces just arent doing what they need
to do.
Also, voice recognition is no longer only available to a few compa
nies that specialize in it. Now that its on phones and wearables,
were seeing more and more companies incorporate it into apps as a
feature. The other day I ordered something from Amazon by shout
ing at my watch. Was it necessary? No. Was it kind of fun? Yes. Will
it lead to my bankruptcy? Probably.
The distant future is harder to predict, because thats how time
works. The combination of big data and improved voice recognition
might get us closer to real conversations with computers sooner
than we imagine. But, the real opportunities lie in allowing people to
interact more naturally with devices where screens and keyboards
dont make sense.
Tony Sheeder thinks that theres a lot to be done with Virtual Reality
and Augmented Reality:
People in gaming environments shouldnt be tied to clicking on
buttons to interact with elements. Speech offers nuance and fine-
grained possibilities you dont get from other interfaces.
Rebecca Nowlin Green predicts more virtual assistants and avatars
and more one-on-one interactions with entities that represent
companies:

28 | Chapter 1: Design for Voice Interfaces


Big data will get more and more relevant by tracking user behaviors
over time and using that information to influence future experien
ces. Your favorite coffee shop will wake you up with a coffee
assistant.
Abi Jones thinks that well get more ubiquitous interfaces that are
only there when you need them. She also thinks that using voice for
accessibility purposes will improve things overall:
A lot of things we had originally for accessibility, like curb cuts or
elevators, are good for all of us.
But, as with any new technology, were going to see a lot of bad
interfaces in the beginning as companies learn when to use voice
and designers learn how to use it to make things better for custom
ers. As Marco Iacono points out:
When the Apple Watch launched, it took awhile for developers to
translate their service into this new product. Initially, some
approached it as a shrinking down of their iPhone app functional
ity. But it didnt take long to identify the key features that are rele
vant for the short, snackable interactions on your wrist.
Most of the VUI designers I talked to for this report had 15 or 20
years of experience in this sort of design, so saying were at the
beginning might seem laughable to them, but its clear that were still
very early in the process of this technology becoming mainstream.
As Thomas Hebner says, This is an incredibly exciting time in
voice design. Voice designers are mostly in IVR, but with the APIs
opening up and with more consumer electronics, were on the edge
of a boom. The world hasnt woken up to the idea that it needs voice
design. Early on, there were some really bad IVR apps, and then it
got better. Were at the beginning of that pain now. In a couple of
years, people will be clamoring for voice design.

Becoming a VUI Designer


If Hebner is correct, now is an outstanding time to become a VUI
designer. If youre transitioning from more traditional UX or prod
uct design, you should find it a fairly natural transition. In fact, if
youre currently designing apps or working with wearables or smart
home devices, you might not have much of a choice about learning
the fundamentals of VUI design.
Some companies, such as Nuance Communications, have been
building voice technologies for more than 25 years. Others, like

How Long Until Star Trek??? | 29


Google and Amazon, are just at the beginning of incorporating
voice into their products. Voice input is completely integrated into
all of the smartphones and watches currently on the market, and
there is still enormous room for growth.

Resources
The hardest part about becoming a VUI designer right now might
be the lack of classes and resources available to new designers. If
youre serious about voice design, your best bet is to get a solid
grounding in User-Centered Design, good user research techniques,
and information architecture, and then to begin working with a
team with some voice design experience.

Books
Nass, Clifford. Wired for Speech: How Voice Activates and
Advances the Human-Computer Relationship.
Cohen, Michael H., James P. Giangola, and Jennifer Balogh.
Voice User Interface Design.

Conferences and Talks


Fundamentals of voice interface design, Tanya Kraljic (Nuance
Communications) at OReilly Design Conference, January 19
20, 2016.
Evangelizing and Designing Voice User Interface: Adopting
VUI in a GUI world, Stephen Gay and Susan Hura.

Organization
The Association for Voice Interaction Design

30 | Chapter 1: Design for Voice Interfaces


About the Author
Laura Klein fell in love with technology when she saw her first user
research session in 1995. Since then, shes worked as an engineer,
UX designer, and product manager at both startups and large com
panies in Silicon Valley. Her book, UX for Lean Startups (OReilly),
and her popular design blog, Users Know, both help teams learn
more about their users and apply that knowledge in order to build
better products. She currently consults with companies that want to
improve their research, user experience, and product development
processes.

You might also like