Voice technology is one of the biggest trends in the healthcare space. We look at how it might help care providers and patients, from a woman who is losing her speech, to documenting healthcare records for doctors. But how do you teach AI to learn to communicate more like a human, and will it lead to more efficient machines?
- Kenneth Harper, VP & GM, Healthcare Virtual Assistants and Ambient Clinical Intelligence at Nuance
- Bob MacDonald, Technical Program Manager, Project Euphonia, Google
- Julie Cattiau, Project Manager, Project Euphonia, Google
- Andrea Peet, Project Euphonia user
- David Peet, Attorney, husband of Andrea Peet
- Ryan Steelberg, President and Co-founder, Veritone Inc.
- Hod Lipson, Professor of Innovation in the Department of Mechanical Engineering; Co-Director, Maker Space Facility, Columbia University.
- The Exam of the Future Has Arrived – via Youtube
This episode was reported and produced by Anthony Green with help from Jennifer Strong and Emma Cillekens. It was edited by Michael Reilly. Our mix engineer is Garret Lang and our theme music is by Jacob Gorski.
Jennifer: Healthcare looks a little different than it did not so long ago…when your doctor likely wrote down details about your condition on a piece of paper…
The explosion of health tech has taken us all sorts of places… digitized records, telehealth, AI that can read x-rays and other scans better than people, and just medical advancements that would have sounded like science fiction until pretty recently.
We’re at a stage where it’s safe to say healthcare is Silicon Valley’s next battleground… with all the biggest names in tech jockeying for position.
And squarely placed among the biggest trends in this space… is voice technology… and how it might help care providers and patients.
Like a woman rapidly losing her speech to communicate with smart devices in her home.
Andrea Peet: My smartphone can understand me.
Jennifer: Or… a doctor who wants to focus on patients, and let technology do the record keeping.
Clinician: Hey Dragon, start my standard order set for arthritis pain.
Jennifer: Voice could also change how AI systems learn… by replacing the 1’s and 0’s in training data with an approach that more closely mirrors how children are taught.
Hod Lipson: We humans, we don’t think in words. We think in sounds. It’s a somewhat controversial idea, but I have a hunch and there’s no data for this, that early humans communicated with sounds way before they communicated with words.
Jennifer: I’m Jennifer Strong and, this episode, we explore how AI voice technology can make us feel more human… and how teaching AI to learn to communicate a little more like a human might lead to more efficient machines.
OC:…you have reached your destination.
Ken Harper: In healthcare specifically, There’s been a major problem over the last decade as they’ve adopted the electronic health systems, everything’s been digitized but it has come with a cost in that you’re spending lots and lots of time actually documenting care.
Ken Harper: So, I’m Ken Harper. I am the general manager of the Dragon Ambient Experience, or DAX as we like to refer to it. And what DAX is, it’s an ambient capability where we will listen to a provider and patient having natural conversation with one another. And based on that natural conversation, we will convert that into a high quality clinical note on behalf of the physician.
Jennifer: DAX is A-I powered… and it was designed by Nuance, a voice recognition company owned by Microsoft. Nuance is one of the world’s leading players in the field of natural language processing. Its technology is the backbone of Apple’s voice assistant, Siri. Microsoft paid nearly 20-billion dollars for Nuance earlier this year, primarily for its healthcare tech. It was the most expensive acquisition in Microsoft’s history…after LinkedIn.
Ken Harper: We’ve, probably, have all experienced a scenario where we go see our primary care provider or maybe a specialist for some issue that we’re having. And instead of the provider looking at us during the encounter, they’re on their computer typing away. And what they’re doing is they’re actually creating the clinical note of why you’re in that day. What’s their diagnosis? What’s their assessment? And it creates an impersonal experience where you don’t feel as connected. You don’t feel as though the provider is actually focusing on us.
Jennifer: The goal is to pass this administrative work off to a machine. His system records everything that’s being spoken, transcribes it, and tags it based on individual speakers.
Ken Harper: And then we take it a step further. So this is not just speech recognition. You know, this is actually natural language understanding where we will take the context of what’s in that transcription, that context of what was discussed, our knowledge of what’s medically relevant, and also what’s not medically relevant. And we will write a clinical note based on some of those key inputs that were in the recording.
Jennifer: Under the hood, DAX uses deep learning—which is heavily dependent on data. The system is trained on a number of different interactions between patients and physicians— and their medical specialties.
Ken Harper: So the macro view is how you get an AI model that understands by specialty generally, what needs to be documented. But then on top of that, there’s a lot of adaptation at the micro view, which is at the user level, which is looking at an individual provider. And as that provider uses DAX for more and more of their encounters, DAX will get that much more accurate of how to document accurately and comprehensively for that individual provider.
Jennifer: And it does the processing.. in real time.
Ken Harper: So if we know that a heart murmur is being discussed, and here’s the information about the patient on their history, this could enable a lot of systems to provide decision support or evidence-based support back to the care team on something that maybe they should consider doing from a treatment perspective or maybe something else they should be asking about and doing triage on. The long-term potential is you understand context. You understand the signal of what’s actually being discussed. And the amount of innovation that can happen, once that input is known, it’s never been done before in healthcare. Everything in healthcare has always been retrospective or you put something into an electronic health record and then some alert goes off. If we could actually bring that intelligence into the conversation where we know something needs to be flagged or something needs to be discussed, or there’s a suggestion that needs to be surfaced to the provider. That’s just going to open up a whole new set of capabilities for care teams.
Julie Cattiau: Unfortunately those voice enabled technology don’t always work well today for people who have speech impairments. So that’s the gap that we were really interested in filling and addressing. And so what we believe is that making voice enabled assistive technology more accessible can help people who have this kind of conditions be more independent in their daily lives
Julie Cattiau: Hi, my name is Julie Cattiau. I’m a product manager in Google research. And for the past three years, I’ve been working on project Euphonia, which goal is to make speech recognition work better for people who have speech disabilities.
Julie Cattiau: So the way that technology works is that we are personalizing the speech recognition models for individuals who have speech impairments. So In order for our technology to work, we need individuals who have trouble being understood by others to record a certain number of phrases. And then we use those speech samples as examples to train our machine learning model to better understand the way they speak.
Jennifer: The project started in 2018, when Google began working with a non-profit seeking a cure for ALS. It’s a progressive, nervous system disease that affects nerve cells in the brain and the spinal cord—often leading to speech impediments.
Julie Cattiau: One of their projects is to record a lot of data from people who have ALS in order to study the disease. And as part of this program, they were actually recording speech samples from people who have ALS to see how the disease impacts their speech over time, so Google had a collaboration with ALS TDI to see if we could use machine learning to detect ALS early but some of our research scientists at Google, when they listened to those speech samples and asked themselves the question: could we do more with those recordings? And instead of just trying to detect whether someone has ALS could we also help them communicate more easily by automatically transcribing what they’re saying. We started this work from scratch and since 2019, about a thousand different people, individuals with speech impairments have recorded over a million utterances for this research initiative.
Andrea Peet: My name is Andrea Peet and I was diagnosed with ALS in 2014. I run a non-profit.
David Peet: And my name is David Peet. I’m Andrea’s husband. I’m an attorney for my day job, but my passion is helping Andrea run the foundation, the Team Drea foundation to end ALS through innovative research.
Jennifer: Andrea Peet started to notice something was off in 2014… when she kept tripping over her own toes during a triathlon.
Andrea Peet: So I started going to neurologists and it took about eight months. But I was diagnosed with ALS which typically has a lifespan of two to five years and so I am doing amazingly well, that I’m still alive and, talking and walking, with a walker, seven years later.
David Peet: Yeah, I second, everything you said about really just feeling lucky. Um, that’s probably the best, the best word for it. When we received the diagnosis and I’d started doing research that two to five years was really the average, we knew from that diagnosis date in 2014, we would be lucky to have anything after May 29th, 2019. And so to be here and to still see Andrea competing in marathons and out in the world and participating in podcasts like this one, it’s a real blessing.
Jennifer: One of the major challenges of this disease—it affects people in very different ways. Some lose motor control of their hands and can’t lift their arms, but would still be able to give a speech. Others can still move their limbs but have difficulty speaking or swallowing…as is the case here
Andrea Peet: People can understand me most of the time. But when I am tired or when I am in a loud place, it is harder for me to uh, um..
David Peet: It’s harder for you to pronounce, is it?
Andrea Peet: To project…
David Peet: Ahh, to pronounce and project words.
Andrea Peet: So Project Euphonia, basically, live captions, what I’m saying on my phone so people can read along what I am saying. And it’s really helpful when I am giving presentations.
David Peet: Yeah, it’s really helpful when you’re giving a presentation or when you are out speaking publicly to have a platform that captures in real time the words that Andrea is saying so that she can project them out to those that are listening. And then the other huge help for us is that Euphonia syncs up what’s being captioned to our Google home, right? And so having a smart home that can understand Andrea and then allow her different functionality at home really gives her more freedom and autonomy than she otherwise would have. She can turn the lights on, turn the lights off. She can open the front door for someone who’s there. So, being able to have a technology that enables them to function using only their voice is really essential to allowing them to feel human, right? Continue to feel like a person and not like a patient that needs to be waited on 24 hours a day.
Bob MacDonald: I didn’t come into this with a professional speech or language background. I actually became involved because I heard that this team was working on technologies that were inspired by people with ALS and my sister’s husband had passed away from ALS. And so I knew how profoundly helpful that would be if we could make tools that would help ease communication.
Jennifer: Bob MacDonald also works at Google. He’s a technical program manager on Project Euphonia.
Bob MacDonald: A big focus of our effort has been improving speech recognition models by personalizing them. Partly because that’s what our early research has found, gives you the best accuracy boost. And you know, that’s not surprising that if you use speech samples from just one person, you can kind of fine tune the system to understand that one person, a lot better. Someone who doesn’t sound exactly like them, the improvements tend to get washed out. But then as you think about, well, even for one person, if their voice is changing over time, because the disease is progressing or they’re aging, or there’s some other issue that’s going on. Maybe even they’re wearing a mask or there’s some temporary factor that’s modulating their voice, then that will definitely degrade the accuracy. The open question is how robust are these models to those kinds of changes. And that’s very much one of the other frontiers of our research that we’re pursuing right now.
Jennifer: Speech recognition systems are largely trained on western, english-speaking voices. So it’s not just people with medical conditions who have a hard time being understood by this tech… it’s also challenging for those with accents and dialects.
Bob MacDonald: So the challenge really is going to be how do we make sure that that gap in performance doesn’t remain wide or get wider as we span larger population segments and really try to maintain a useful level of performance and that all gets even harder as we move away from the primary languages that are used and products that most commonly have these speech recognizers embedded. So as you move to countries or parts of countries where languages have fewer speakers, the data becomes even harder to come by. And so it’s going to require just a bigger push to make sure that we maintain that kind of a reasonable level of equity.
Jennifer: Even if we’re able to solve the speech diversity problem, there’s still the issue of the massive amounts of training data needed to build reliable, universal systems.
But what if there was another way—one that takes a page from how humans learn?
That’s after the break.
Hod Lipson: Hi. My name is Hod Lipson. I’m a roboticist. I’m professor of engineering and data science at Columbia university in New York. And I study robots, how to build them, how to program them, how to make them smarter.
Hod Lipson: Traditionally, if you look at how AI is trained. We give very concise labels to things and then we train an AI to predict one for a cat, two for a dog, this is how all the deep learning networks today are being trained with these very, very compacted labels.
Hod Lipson: Now, if you look at the way humans learn, they look very differently. When I show my child pictures of dogs, or I show them our dog or a dog, other people’s dogs walking outside, I don’t just give them one bit of information. I actually enunciate the word “dog.” I might even say dog in different tones and I might do all kinds of things. So I give them a lot of information when I label the dog. And that got me to think that maybe we are teaching computers in the wrong way. So we said, okay, let’s do this crazy experiment where we are going to train computers to recognize cats and dogs and other things, but we’re going to label it not with the one and the zero, but with a whole audio file. In other words, the computer needs to be able to say, articulate, the word “dog”. The whole audio file. Every time it sees a dog. It’s not enough for it to say you know, thumbs up for dog, thumbs down for cat. You actually have to articulate the whole thing.
Jennifer: To the surprise of him and his team… It worked. It identified images — just as well as using ones and zeros.
Hod Lipson: But then we noticed something very, very interesting. We noticed that it could learn the same thing with a lot less information. In other words, it would get the same quantity, quality of result, but it’s seeing about a 10th of the data. And that is in itself very, very valuable, but also we also noticed something, even something that’s potentially more interesting is that when it learned to distinguish between a cat and a dog it learned it in a much more resilient way. In other words, it was not as easily fooled by, you know, tweaking a pixel here and there and making the dog look a little bit more like a cat and so on. To me it feels like, you know, there’s something here. It means that maybe we’ve been training neural networks the wrong way. Maybe we were stuck in 1970s thinking where we’re, you know, stingy about data. We’ve moved forward incredibly fast when it comes to the data we use to train the system, but when it comes to the labels, we’re still thinking like 1970s, with the ones and zeros. So that may be something that can change the way we think about how AI is trained.
Jennifer: He sees the potential for helping systems gain efficiency, train with less data or just be more resilient. But he also believes this could lead to AI systems that are more individualized.
Hod Lipson: Maybe it’s more easier to go from an image to audio than it is with a bit. A bit, it’s sort of unforgiving. It’s either right or wrong. Whereas an audio file, there’s so many ways to say dog, then maybe it’s more forgiving. So a lot of speculation about why that is, things that are easier. Maybe they’re easier to learn. Maybe, this is a really interesting hypothesis, maybe the way we say dog and cat is actually not a coincidence. Maybe we have chosen evolutionarily. We could’ve called, you know, we could have called a cat, you know, a smog instead of a dog. Okay. A cat. It would be too close to a dog and it would be confusing and nobody. It would take kids longer to tell the difference between a cat and a dog. So we humans have evolved to choose language and enunciations that are easy to learn and are appropriate and so maybe that touches also on sort of the history of language.
Jennifer: And he says, the next stage of development?… could be allowing AI to produce it’s own language in response to the images it’s shown.
Hod Lipson: We humans choose particular sounds in part because of our physiology and the kind of frequencies we can emit and all kinds of physical constraints. But if the AI can produce sounds in other ways, maybe it can produce its own language that is both easier for it to communicate and think, but also maybe it’s easier for it to learn. So, if we show it a cat and a dog and then it’s going to see a giraffe that he never saw before. I want it to come up with a name. And there’s a reason for that maybe because, you know, it’s based on how it looks with relationship to a cat and a dog and we’ll see where it goes from there. So if it learns with less data and if it’s more resilient and if it can make analogies more efficiently and, you know, see if it’s just a happy coincidence or if there’s really something deep here. And this is, I think, the sort of question that we need to answer next.
Jennifer: This episode was reported and produced by Anthony Green with help from me and Emma Cillekens. It was edited by Michael Reilly. Our mix engineer is Garret Lang and our theme music is by Jacob Gorski.
Thanks for listening, I’m Jennifer Strong.