The Nitty-Gritty of Voice Technology
Voice interaction is the primordial human interaction. From early grunts to rhythmic iambic pentameter to lyrical music renditions of the spoken word, its very antiquity and various evolutions are rich with complexity.
In 1968, the seminal movie 2001: A Space Odyssey, envisioned a fully-aware computer system capable of nuanced conversation as well as self-awareness.
Today, consumers are exposed to varying degrees of voice technology. And the mission of creating a fully interactive, computer-based voice experience remains.
The process of applying compute resources and algorithms to capture, recreate and ultimately mimic a seemingly intuitive process exposes just how complex speaking and listening comprehension can be.
At a high-level, there are three parts to Voice Technology (VT) –
- Acoustics. The acoustics part of VT is the piece that takes in sound as an input to the process and outputs phonemes or fundamental linguistic units.
- Language. The language aspect of VT deals with the higher-level abstractions of chaining phonemes into words, words into phrases and/or sentences.
- Context. Context is the result of a myriad of options for presenting the literal meaning of phrases and/or sentences and finally the contextual meaning of phrases and/or sentences.
That’s a lot to parse, so let’s break it up with examples to better understand the complexity.
The human voice is from the sound produced by the vocal tract. Like any sound, it is a combination of simple physics. Sound is a mix of Pitch, Amplitude (loudness), and Rate (speed).
As the first step in VT, the sound must get into the computer. The analog waves of sound are digitized, and herein lies the first complexity.
Is it 0.999999… or 1.0?
A classic example for exposing the imprecise nature of analog to digital conversion is the thirds problem. We all know 1/3 + 1/3 + 1/3 = 1. But computers understand 1/3 = 0.3333…, and 0.333…+ 0.333… + 0.333…= 0.999…. Or almost, but not exactly, one.
Coming back to VT, the process of analog to digital conversation requires sampling (measuring and recording) the sound waves into small time-slices. The faster the sampling (1 sec, 100 msec, 10 msec,…), the more precise is the digital representation.
These time-slices, now available as discrete numerical values, are then pattern matched to the appropriate phonemes. The vast computational resources available to voice technology processing today make short-order of pattern matching. However, not everyone speaks clearly, concisely, and with adequate pauses to ensure clean phoneme matching.
To get a sense of the complexities, let’s examine the word ‘cat.’ Focus on the first phoneme, the ‘KA’ sound. In the analog to digital conversion, if the entire ‘KA’ phoneme is captured in a single time-slice, then pattern-matching it to ‘KA’ is possible. In the alternate, if the phoneme ‘KA’ is split across two time-slices, ‘iK’ in one time-slice, and ‘Aa’ in the 2nd time-slice, the matching of the ‘KA’ phoneme becomes impossible.
It’s not what you say but how you say it:
The diversity of human voices and speech patterns, while amazing to cherish, is a larger technical problem in itself. Using the same example as above, the way ‘KA’ phoneme is uttered at a lower pitch by men and a higher pitch by women, or fast-talking northerners versus an extended southern drawl results in different digital patterns.
Adapting algorithms to account for these variabilities is challenging. Imagine you want a computer to match a square, a simple shape with four sides. Sometimes the item to match is tall and narrow, sometimes is wide and short. Sometimes it’s hard to tell where the left edge of the box starts or where the right edge finishes. Maybe the top edge has a slight curve or the bottom edge a dip. All are ‘boxes,’ but in different sizes and shapes.
The same issue happens in VT. Amplitude (Loudness) and Rate (Speed) at which the same phoneme is uttered varies by racial, cultural, regional, social, and other aspects of the wonderful diversity between the speakers.
Computers are great with ‘exact,’ yes or no, on or off. But people, and their speech patterns, are far from ‘exact’.
Assuming all of the phonemes are extracted correctly in the analog to digital conversion process, the next chain of complexities start – putting sounds together into something we recognize as ‘words.’
From Sound to Word
As children, humans start building the vocabulary of their native language and quickly learn how to match phonemes to words within their vocabulary. For example, most English speakers will recognize that ‘Ka’ and ‘iT’ phoneme in that same order forms the word, ‘cat.’
But consider the sounds in reverse order: ’iT-Ka’ is not a word. While this is easily recognizable to humans, it needs to be accounted for in the language algorithms.
Human languages are well diagramed and described, so the matching may be straightforward if there’s good input data.
Identifying the correct phonemes and delineating them into actual words is tough, but then it gets exponentially more difficult.
I need some context here
Take a moment and say these two phrases out loud:
“Recognize speech.” and then “Wreck a nice beach.”
Can you hear it? The two phrases are both formed from the same phonemes and sound very similar. Try to imagine how the speech processing engine sees it:
Phonemes are all not directly converted to words and then words to phrases. Phonemes are chained into a phrase, then a second pass of processing identifies appropriate words to fit the phrase. It’s possible to occasionally witness this process if you look your device’s screen when it’s processing a speech command.
The words will begin to appear, and then a pause, while additional context processing struggles to understand if the speaker meant ‘see’ or ‘sea’.
The two phrases above are formed from the same speech input. Which is the correct phrase depends on the steps further down the conversion chain.
I know what I heard, but what does it mean?
Each language has its own complexity related to homonyms as well as homophones – words that sound the same but have different meaning versus words that sound the same but have different meaning as well as different spelling. This is taught in primary school, but is shockingly complex for an algorithm to ‘understand.’
Homonym complexity – The word ‘fair’ can have varying meanings as ‘fair in appearance’ or a ‘trade fair.’ Both these words are constructed from the same phoneme and have the same spelling too.
Homophone complexity – The words ‘see’ and ‘sea’ are constructed from the same phoneme with a slight change in spelling and completely different meanings.
Even if the algorithm gets the phoneme structure correct, it’s ripe for mistakes and misunderstandings without added context.
For most people, this is the level of Voice Technology they experience through Alexa, Google, or Siri. The ability to ask a single, open-ended phrase and get some response. The Acoustic and Language processes, coupled with ever-improving Context-awareness of these and many other providers, have gotten pretty good. Each day new dialects and languages are added to the system’s pool of understanding.
The natural next step is to move beyond asking single command, single response questions (“What is the weather?” Or “Play this song” or “What is the Giants score?”)
Conversation – the last frontier
To understand the challenge ahead, imagine the following exchange:
I say: “This is the best Medium post ever.”
You say: “Yeah, right.”
No doubt, in simple language terms this is an emphatic agreement – the best Medium post ever. Or is it?
Most English speakers will know that the response is sarcastic, and therefore the exact opposite of the literal meaning. Concepts deciphering literal versus contextual meaning in the case of, for example, sarcasm, is exceptionally difficult. Embedded in tone, pitch, and context, sarcasm is just one of many complex nuances.
Let’s take another example: “The hammer hit the glass table, and it broke.”
Again, almost all of us can immediately decipher that the word ‘it’ refers to the ‘glass table’ and not the ‘hammer.’
But take a moment to analytically consider how were we able to decipher that phrase. Then how do we make a computer achieve the same level of awareness?
This contextual meaning – something a 3-year old child can naturally understand is extremely complicated for a computer to ‘understand.’
Aspects of verbal communication like sarcasm, humor, homonyms, and homophones are hard enough for computer-based interpretation. But even mastering those may not be enough.
Various studies have shown that the audio aspect of communicating, the only input available for computers, constitutes anywhere from 30% to just 7% of the full communication experience. Up to 93% of communication is non-verbal, conveyed primarily through facial expressions, eye/focus, hand gestures, posture, and more.
What chance do Alexa and Siri have for engaging conversation if they have 30% or less of the full picture?
These above complexities are just theoretical aspects of Voice Technology. There is an ocean of complexities behind the speaker’s nuance and the context in which the speaker and listener are in.
Managing the enormous set of vagaries to create a computer-based interactive conversation that seems natural is exceedingly difficult. To keep one’s ‘eye on the prize,’ so to speak, many organizations host contests and offer prizes.
On not winning these prizes yet, “I am ok” – Can you figure out how I really am from this sentence? When will a Computer?
Shiva Nathan is the Founder and CEO of Onymos, and a veteran engineering leader.