"Apologies, I didn't understand that": AI misinterprets the words of certain individuals more than those of others.
The concept of a human-like artificial intelligence assistant that allows for conversation has captivated many since the debut of "Her," Spike Jonze’s 2013 film about a man who develops feelings for an AI named Samantha, resembling Siri. Throughout the movie, the main character struggles with the reality that, despite her seemingly genuine nature, Samantha is not and will never be human.
Now, twelve years later, this notion has shifted from science fiction to reality. Generative AI tools like ChatGPT, along with digital assistants such as Apple’s Siri and Amazon’s Alexa, aid people in tasks like navigating routes and creating shopping lists. However, similar to Samantha, automated speech recognition systems still fall short of fully replicating the capabilities of a human listener.
Many have experienced the annoyance of contacting their bank or utility provider only to find themselves needing to repeat information for the digital customer service bot to comprehend. Perhaps you’ve dictated a message on your phone only to spend time correcting jumbled words.
Studies in linguistics and computer science have indicated that these systems perform inconsistently across different user groups. They usually produce more errors for individuals with non-native or regional accents, Black individuals, speakers of African American Vernacular English, code-switchers, women, older adults, younger individuals, or those with speech impairments.
Unlike a human, automatic speech recognition systems are not considered "sympathetic listeners" by researchers. Rather than striving to understand by incorporating cues like tone or facial expressions, these systems often give up or make probabilistic guesses, which can lead to mistakes.
As more companies and public organizations implement automatic speech recognition tools to save costs, people's interactions with these systems become increasingly unavoidable. However, as these technologies find their way into critical sectors such as emergency services, healthcare, education, and law enforcement, the potential for serious repercussions arises when they fail to accurately interpret spoken words.
Imagine a scenario where you’ve been injured in a car accident. When you call 911 for assistance, you are directed to a bot that screens out non-emergency calls, requiring multiple attempts for your message to be understood, thus wasting valuable time and escalating your stress during a critical moment.
What leads to such errors? Many disparities stem from inherent biases in the linguistic data that developers use to train large language models. These developers educate AI systems on human language through vast amounts of text and audio recordings of actual speech. However, they often rely on a narrow representation of voices.
If a system achieves high accuracy rates by engaging primarily with affluent white Americans in their 30s, it stands to reason that it has been predominantly trained on material reflecting those demographics.
By undertaking extensive data collection from a wide array of sources, AI developers could potentially minimize these inaccuracies. Yet, creating AI systems capable of comprehending the vast spectrum of human speech variations—including factors like gender, age, race, primary vs. secondary language, socioeconomic background, ability, and more—requires substantial time and resources.
For non-English speakers—representing a significant portion of the global population—the situation is even more challenging. Most of the leading generative AI systems have been developed in English, functioning significantly better in that language compared to others. Although AI holds tremendous promise for enhancing translation and access to diverse information, many languages still lack a robust digital presence, impeding the development of large language models.
Even languages well-supported by large models, such as English and Spanish, yield varied user experiences based on dialects. Currently, the majority of speech recognition systems and generative AI chatbots reflect the linguistic biases present in their training datasets. They reinforce prescriptive and sometimes biased views of what constitutes "correct" speech.
In fact, AI has been shown to "flatten" linguistic diversity. Some AI startups offer services to remove users' accents, operating under the assumption that their primary clients would be customer service representatives from countries like India or the Philippines. This perspective perpetuates the idea that certain accents are less legitimate than others.
It is anticipated that AI will improve in processing language, accommodating variables such as accents and code-switching. In the United States, public services are mandated by federal law to ensure equitable access to services, irrespective of a person’s spoken language. However, it remains uncertain whether this will suffice to motivate the tech industry to address linguistic inequities.
Many individuals may prefer speaking with a real person when inquiring about a bill or health concern or at least want the option to bypass automated systems for essential services. While miscommunication can occur in human interactions, real individuals are typically more prepared to listen empathetically.
With AI, the current reality is binary: either it works or it does not. If the system understands you, that’s great; if not, you must find a way to clarify your message.
Roberto Rey Agudo, Research Assistant Professor of Spanish and Portuguese, Dartmouth College
This article is republished from The Conversation under a Creative Commons license. Read the original article.
Other articles
"Apologies, I didn't understand that": AI misinterprets the words of certain individuals more than those of others.
Recent studies reveal that speech recognition systems are less accurate for women and Black individuals, among other demographics.
