【TED】Rupal Patel: Synthetic voices, as unique as fingerprints (Rupal Patel: Synthetic voices, as unique as fingerprints)

Subtitles section Play video

I'd like to talk today
about a powerful and fundamental aspect
of who we are: our voice.
Each one of us has a unique voiceprint
that reflects our age, our size,
even our lifestyle and personality.
In the words of the poet Longfellow,
"the human voice is the organ of the soul."
As a speech scientist, I'm fascinated
by how the voice is produced,
and I have an idea for how it can be engineered.
That's what I'd like to share with you.
I'm going to start by playing you a sample
of a voice that you may recognize.
(Recording) Stephen Hawking: "I would have thought
it was fairly obvious what I meant."
Rupal Patel: That was the voice
of Professor Stephen Hawking.
What you may not know is that same voice
may also be used by this little girl
who is unable to speak
because of a neurological condition.
In fact, all of these individuals
may be using the same voice,
and that's because there's only a few options available.
In the U.S. alone, there are 2.5 million Americans
who are unable to speak,
and many of whom use computerized devices
to communicate.
Now that's millions of people worldwide
who are using generic voices,
including Professor Hawking,
who uses an American-accented voice.
This lack of individuation of the synthetic voice
really hit home
when I was at an assistive technology conference
a few years ago,
and I recall walking into an exhibit hall
and seeing a little girl and a grown man
having a conversation using their devices,
different devices, but the same voice.
And I looked around and I saw this happening
all around me, literally hundreds of individuals
using a handful of voices,
voices that didn't fit their bodies
or their personalities.
We wouldn't dream of fitting a little girl
with the prosthetic limb of a grown man.
So why then the same prosthetic voice?
It really struck me,
and I wanted to do something about this.
I'm going to play you now a sample
of someone who has, two people actually,
who have severe speech disorders.
I want you to take a listen to how they sound.
They're saying the same utterance.
(First voice)
(Second voice)
You probably didn't understand what they said,
but I hope that you heard
their unique vocal identities.
So what I wanted to do next is,
I wanted to find out how we could harness
these residual vocal abilities
and build a technology
that could be customized for them,
voices that could be customized for them.
So I reached out to my collaborator, Tim Bunnell.
Dr. Bunnell is an expert in speech synthesis,
and what he'd been doing is building
personalized voices for people
by putting together
pre-recorded samples of their voice
and reconstructing a voice for them.
These are people who had lost their voice
later in life.
We didn't have the luxury
of pre-recorded samples of speech
for those born with speech disorder.
But I thought, there had to be a way
to reverse engineer a voice
from whatever little is left over.
So we decided to do exactly that.
We set out with a little bit of funding from the National Science Foundation,
to create custom-crafted voices that captured
their unique vocal identities.
We call this project VocaliD, or vocal I.D.,
for vocal identity.
Now before I get into the details of how
the voice is made and let you listen to it,
I need to give you a real quick speech science lesson. Okay?
So first, we know that the voice is changing
dramatically over the course of development.
Children sound different from teens
who sound different from adults.
We've all experienced this.
Fact number two is that speech
is a combination of the source,
which is the vibrations generated by your voice box,
which are then pushed through
the rest of the vocal tract.
These are the chambers of your head and neck
that vibrate,
and they actually filter that source sound
to produce consonants and vowels.
So the combination of source and filter
is how we produce speech.
And that happens in one individual.
Now I told you earlier that I'd spent
a good part of my career
understanding and studying
the source characteristics of people
with severe speech disorder,
and what I've found
is that even though their filters were impaired,
they were able to modulate their source:
the pitch, the loudness, the tempo of their voice.
These are called prosody, and I've been documenting for years
that the prosodic abilities of these individuals
are preserved.
So when I realized that those same cues
are also important for speaker identity,
I had this idea.
Why don't we take the source
from the person we want the voice to sound like,
because it's preserved,
and borrow the filter
from someone about the same age and size,
because they can articulate speech,
and then mix them?
Because when we mix them,
we can get a voice that's as clear
as our surrogate talker --
that's the person we borrowed the filter from—
and is similar in identity to our target talker.
It's that simple.
That's the science behind what we're doing.
So once you have that in mind,
how do you go about building this voice?
Well, you have to find someone
who is willing to be a surrogate.
It's not such an ominous thing.
Being a surrogate donor
only requires you to say a few hundred
to a few thousand utterances.
The process goes something like this.
(Video) Voice: Things happen in pairs.
I love to sleep.
The sky is blue without clouds.
RP: Now she's going to go on like this
for about three to four hours,
and the idea is not for her to say everything
that the target is going to want to say,
but the idea is to cover all the different combinations
of the sounds that occur in the language.
The more speech you have,
the better sounding voice you're going to have.
Once you have those recordings,
what we need to do
is we have to parse these recordings
into little snippets of speech,
one- or two-sound combinations,
sometimes even whole words
that start populating a dataset or a database.
We're going to call this database a voice bank.
Now the power of the voice bank
is that from this voice bank,
we can now say any new utterance,
like, "I love chocolate" --
everyone needs to be able to say that—
fish through that database
and find all the segments necessary
to say that utterance.
(Video) Voice: I love chocolate.
RP: So that's speech synthesis.
It's called concatenative synthesis, and that's what we're using.
That's not the novel part.
What's novel is how we make it sound
like this young woman.
This is Samantha.
I met her when she was nine,
and since then, my team and I
have been trying to build her a personalized voice.
We first had to find a surrogate donor,
and then we had to have Samantha
produce some utterances.
What she can produce are mostly vowel-like sounds,
but that's enough for us to extract
her source characteristics.
What happens next is best described
by my daughter's analogy. She's six.
She calls it mixing colors to paint voices.
It's beautiful. It's exactly that.
Samantha's voice is like a concentrated sample
of red food dye which we can infuse
into the recordings of her surrogate
to get a pink voice just like this.
(Video) Samantha: Aaaaaah.
RP: So now, Samantha can say this.
(Video) Samantha: This voice is only for me.
I can't wait to use my new voice with my friends.
RP: Thank you. (Applause)
I'll never forget the gentle smile
that spread across her face
when she heard that voice for the first time.
Now there's millions of people
around the world like Samantha, millions,
and we've only begun to scratch the surface.
What we've done so far is we have
a few surrogate talkers from around the U.S.
who have donated their voices,
and we have been using those
to build our first few personalized voices.
But there's so much more work to be done.
For Samantha, her surrogate
came from somewhere in the Midwest, a stranger
who gave her the gift of voice.
And as a scientist, I'm so excited
to take this work out of the laboratory
and finally into the real world
so it can have real-world impact.
What I want to share with you next
is how I envision taking this work
to that next level.
I imagine a whole world of surrogate donors
from all walks of life, different sizes, different ages,
coming together in this voice drive
to give people voices
that are as colorful as their personalities.
To do that as a first step,
we've put together this website, VocaliD.org,
as a way to bring together those
who want to join us as voice donors,
as expertise donors,
in whatever way to make this vision a reality.
They say that giving blood can save lives.
Well, giving your voice can change lives.
All we need is a few hours of speech
from our surrogate talker,
and as little as a vowel from our target talker,
to create a unique vocal identity.
So that's the science behind what we're doing.
I want to end by circling back to the human side
that is really the inspiration for this work.
About five years ago, we built our very first voice
for a little boy named William.
When his mom first heard this voice,
she said, "This is what William
would have sounded like
had he been able to speak."
And then I saw William typing a message
on his device.
I wondered, what was he thinking?
Imagine carrying around someone else's voice
for nine years
and finally finding your own voice.
Imagine that.
This is what William said:
"Never heard me before."
Thank you.
(Applause)