Category Archives: Technology

iPhone 4S: speech advances but there’s more to do

By Kimberly Patch

Apple’s iPhone 4S has taken a couple of nice big steps toward adding practical speech to smart phones. There are still some big gaps, mind you. I’ll get to those as well.

Speech on the keyboard

The long-awaited speech button is now part of the keyboard. Everywhere there’s a keyboard you can dictate rather than type. This is far better than having to use an app to dictate, then cut and paste into applications. This is one of the big steps. This will make life much easier for people who have trouble using the keyboard. And I suspect a large contingent of others will find themselves dictating into the iPhone a good amount of time, increasingly reserving the keyboard for situations where they don’t want to be overheard.

The key question about speech on the keyboard is how it works beyond the letter keys and straight dictation.
For instance, after you type
“Great! I’ll meet you at the usual place (pool cue at the ready) at 6:30.”
how easy is it to change what you said to something like this?
“Excellent :-) I’ll meet you at the usual place (pool cue at the ready) at 7:00.”
And then how easy is it to go back to the original if you change your mind again?

Speech assistant

After we all use the speech assistant for a couple of days or weeks it‘ll become readily apparent where Siri lies on the very-useful-to-very-annoying continuum.

The key parameters are
– how much time Siri saves you
– how a particular type of Siri audio feedback hits you the10th time you’ve heard it
– how physically and cognitively easy it is to switch between the assistant and whatever you have to do with your hands on the phone.

One thing that has the potential to tame the annoyance factor is giving users some control over the feedback.

I think the tricky thing about computer-human feedback is it’s inherently different from human-human feedback. One difference is the computer has no feelings and we know that. Good computer-human feedback isn’t necessarily the same as good human-human feedback.

The big gap

There’s still a big speech gap on the iPhone. Speech is still just a partial interface.

Picture sitting in an office with a desktop computer and a human assistant. Type anything you want using the letter keys on your keyboard or ask the assistant to do things for you. You could get a fair amount of work done this way, but there’d still be situations where you’d want to control your computer directly using keyboard shortcuts, arrow keys or the mouse. Partial interfaces have a high annoyance factor.

Even if you use a mix of speech, keyboard and gesture, if you’re able to choose the method of input based on what you want to do rather than what happens to be available, true efficiencies will emerge.

Ultimately, I want to be able to completely control my phone by speech. And I suspect if we figure out how to do that, then make it available for everyone, the general mix of input will become more efficient.

I’d like to see the computer industry tap folks who have to use speech recognition as testers. I think this would push speech input into practical use more quickly and cut out some of the annoyance-factor growing pains.

What do you think? Let me know at Kim@ this domain name.

Speak quietly and you don’t need a big stick


Computers can be frustrating when they don’t act as you anticipate. You can anticipate better, however, if you know what they’re going through.

One thing I’ve found with speech recognition is when something is misrecognized the common instinct is to speak more loudly into the microphone — like you’re talking to a person who didn’t quite hear.

This usually makes things worse. Most good microphones are fairly sensitive, and are more likely to interpret the sounds coming out of your mouth more accurately if you speak a little softer rather than louder.

If you notice recognition falling off a bit, speak a little more softly. Picture someone onstage singing so loudly into the microphone that its distorted, and how nice it is when the person backs off a bit. Also think about putting the microphone further away from your mouth — I keep mine about 2 inches away.

Another common cause of recognition falling off is breathing into the microphone. It’s important that the face of the microphone is pointed slightly downward so you’re not breathing directly into it. Picture someone on an outside stage trying to use a microphone in a heavy wind. It’s lots easier to hear when the wind dies down.

Speak quietly and you don't need a big stick


Computers can be frustrating when they don’t act as you anticipate. You can anticipate better, however, if you know what they’re going through.

One thing I’ve found with speech recognition is when something is misrecognized the common instinct is to speak more loudly into the microphone — like you’re talking to a person who didn’t quite hear.

This usually makes things worse. Most good microphones are fairly sensitive, and are more likely to interpret the sounds coming out of your mouth more accurately if you speak a little softer rather than louder.

If you notice recognition falling off a bit, speak a little more softly. Picture someone onstage singing so loudly into the microphone that its distorted, and how nice it is when the person backs off a bit. Also think about putting the microphone further away from your mouth — I keep mine about 2 inches away.

Another common cause of recognition falling off is breathing into the microphone. It’s important that the face of the microphone is pointed slightly downward so you’re not breathing directly into it. Picture someone on an outside stage trying to use a microphone in a heavy wind. It’s lots easier to hear when the wind dies down.

Speech recognition and Eastern equine encephalitis


I have a bone to pick with Nuance. I’ve several times seen Dragon NaturallySpeaking demonstrators wow people by saying a long phrase. “Eastern equine encephalitis” is a favorite. The implication is if computer speech recognition can get this difficult phrase right, it can get anything right.

The reality is just the opposite, and the demonstration gives people an incorrect mental map of how the speech engine works.

It’s important to have a good mental map of how something works. If your mental map is correct your instincts will be correct. If you’re working with a child you probably have an idea of the types of simple mistakes that child is going to make, and you’ll expect and have more patience for simple mistakes than when you’re working with an expert.

The NaturallySpeaking speech engine is different than either working with a child or an expert — it’s very good at some things, but not so good at others. The mix is different than it is with people. NaturallySpeaking is very good at identifying long words and even better at identifying common phrases — Eastern equine encephalitis is both and therefore very easy. It will rarely get this wrong. What’s more difficult for the engine is getting short utterances and uncommon phrases correct. If you give the speech engine more information to work with — a longer word, a phrase, or even the same word drawn out a bit, it has more information to work with and therefore does better.

A more impressive demo phrase for a speech engine would be “at up be”.

With the correct mental map of what’s easy and what’s difficult for the speech-recognition engine, you’ll instinctively speak in phrases and draw things out a bit if you see the engine start to make mistakes. This is probably different from how you tend to adjust to a person who isn’t hearing. In the case of a person a common instinct is to say one word at a time: “Eastern… equine… encephalitis”, which is more difficult for a speech engine.

The good news is a mental map works on instinct — if your mental map is correct, you often don’t even have to think about adjustments, they flow naturally. The bad news is a mental map works on instinct — if it’s incorrect your adjustments won’t work but it will feel like they should be working.

Research Watch: What you see changes what you hear


Who says looks don’t matter?

It looks like what you see changes what you hear. Researchers from Haskins Laboratories and MIT have found that different facial expressions alter the sounds we hear.

This shows that the somatosensory system — the mix of senses and brain filtering that determines how you perceive your body — is involved when you process speech.

This doesn’t have a whole lot to do with speech commands except to show that it’s easy to underestimate the complexity, and subtlety, of our perception of spoken language.

Resources:

Somatosensory function in speech perception
www.pnas.org/cgi/doi/10.1073/pnas.0810063106

ATIA and training


The Assistive Technology Industry Association (ATIA) 2009 Orlando show starts this week. We’ll be doing a poster session at the show (come see us Thursday, 3:30 to 5:30 in the Front Hall).

 

As part of ATIA we’re offering free Utter Command training sessions to the first 50 people who buy Utter Command during the show, which ends Saturday. UC training includes an assessment, custom lesson plan and remote training session. We use GoToMeeting for the remote training session. GoToMeeting lets you share what’s on your computer screen over the Internet, so we can see your screen and you can see ours.

Talking to your telephone vs. talking to your computer

The SpeechTEK speech conference has a lot to say about the state of the desktop speech interface. The exhibits in and 2006 and 2007 were largely about where all the speech interface action is these days — not on the desktop, but over the telephone with interactive voice response (IVR) systems.

I went to several sessions aimed at the voice user interface designers (Vuids) who construct telephone speech command interfaces (even though I’m something of an imposter as a desktop voice user interface designer — I guess Dvuid would be the appropriate term).

We’re dealing with a lot of the same issues, though often with different twists:

  • Making sure people know what to say and stay oriented in the system
  • Accommodating beginners and experienced users
  • Making the process as fast and efficient as possible so people won’t hit the operator button or hang up (or not use the software — many people who buy desktop speech recognition software end up not using it)
  • In both cases the communications relationship is between a person and machine

And we’re looking at similar answers:

  • Making commands consistent
  • Avoiding ambiguity
  • Doing user testing
  • Thinking about configuring information in a certain order to make it more memorable (good mental maps and appropriate training wheels)
  • And above all avoiding the trap of thinking that people can just say anything because even if you truly could just say anything you still don’t know what to say

I’ve also been thinking about the differences between IVR and the desktop speech interface — these differences make the challenges more difficult or easier for each of the systems.

  • Desktop users tend to follow a more predictable curve — they get more experienced or drop it, while for some IVR systems you have occasional users.
  • People are more often forced to use IVR, while most people can easily avoid the desktop speech interface if they wish.
  • The desktop is capable of both visual and audio feedback, while IVR systems tend to only have audio feedback. (Interestingly, even though most speech engines come with the ability to speak, desktop computer interfaces generally don’t use this feedback channel. We’ve had positive results in user testing of judicious use of audio feedback.)
  • Both systems suffer from the widespread use of pseudo natural language. Natural language doesn’t really exist on either type of system and trying to fake natural language creates its own problems.