Researchers Train AI to Understand the World’s Most Overlooked Languages

Apr 15, 2025 By: daviddefusco

Left to right, Dr. David Li, director of the M.S. in Data Analytics and Visualization, and Hang Yu and Ruiming Tian, both students in the M.S. in Artificial Intelligence, presented their study in March at IEEE SoutheastCon 2025 in Charlotte, N.C.

By Dave DeFusco

In our globally connected world, technology needs to understand people—no matter what language they speak. Whether you're using voice assistants, translating documents or asking questions online, artificial intelligence is increasingly expected to work across dozens of languages. But while tools like Google Translate and chatbots handle English and other major languages well, they often stumble when dealing with those that aren’t widely spoken or studied.

So how can machines learn to understand languages that don’t have large libraries of digital text or annotated examples? That’s the challenge a team of Katz School researchers led by Dr. David Li, director of the M.S. in Data Analytics and Visualization, set out to solve with a new framework that significantly improves how AI understands “low-resource languages”—languages that lack the massive training datasets available for English, Spanish or Mandarin.

The Katz School team presented their study, “Cross-Lingual Text Augmentation: A Contrastive Learning Approach for Low-Resource Languages, in March at IEEE SoutheastCon 2025 in Charlotte, N.C.

“Our work centers on a field called cross-lingual natural language understanding, which involves building systems that can learn from high-resource languages and apply that knowledge to others,” said Dr. Li. “Our approach combines clever data techniques and training methods to help machines ‘transfer’ what they’ve learned in one language to many others—without needing massive amounts of new information.”

At the heart of today’s language AI are models like XLM-RoBERTa and mBERT—powerful tools trained on text from dozens of languages. These models are surprisingly good at capturing patterns that are shared across languages, such as sentence structure or word meaning. But their performance drops dramatically when they deal with languages that have little training data because these models rely heavily on examples.

If a language doesn’t have many labeled datasets—sentences paired with their meanings or categories—the model can’t learn the nuances it needs to perform well. And it’s not just about having enough data. Sometimes the data that is available comes from a narrow field—say, medical journals or government documents—so the model can’t apply it easily to other domains like news articles or casual speech.

Traditional fixes, like creating synthetic data through back-translation—translating a sentence into another language and back again—or swapping in synonyms, help to a degree. But for truly underrepresented languages, even these strategies fall short—especially if good translation models don’t exist for them. That’s where this new research takes things a step further.

The researchers designed a multi-pronged strategy that makes language models more flexible, efficient and accurate in multilingual settings. Their approach focuses on four main innovations:

Better Data Augmentation: Instead of relying on just one method, the team combined several: back-translation, synonym swapping and even changing sentence structures. This mix of methods helped create more diverse, higher-quality training examples without introducing too much noise or error.
Contrastive Learning: The model is trained to recognize when two sentences in different languages mean the same thing—and when they don’t. This strengthens the model’s ability to match meanings across languages, even if the surface words look nothing alike.
Dynamic Weight Adjustment: When learning multiple languages, AI often either overgeneralizes or misses the subtle features of each language. This feature lets the model dynamically balance general knowledge with language-specific quirks, keeping it accurate without losing sensitivity to detail.
Adaptation Layers: These are like special filters added to the model that help it tune its responses to a specific task or language. They make the model more flexible and help it perform well even with just a small amount of labeled data.

To see how their system measured up, the researchers tested it on three large datasets used in multilingual AI research: XNLI, which checks whether a model can understand logical relationships in sentences, like contradiction or agreement, across 15 languages; MLQA, which tests how well models answer questions in seven languages; and XTREME, a mega-benchmark covering 40 languages and a variety of tasks, from classification to structured prediction.

“In all cases, our new framework outperformed traditional methods, especially in low-resource settings,” said Hang Yu, a co-author of the study and student in the M.S. in Artificial Intelligence. “The biggest gains came when contrastive learning and augmentation were combined—showing that giving the model diverse, quality examples and helping it link meanings across languages are both essential.”

Even more impressive, the improvements came with only a small increase in computing power and memory use. That makes the framework a practical option for real-world applications, where resources and time are often limited.

To understand what really made the difference, the researchers ran an ablation study—essentially, turning off one component at a time to see what impact it had. Here’s what they found:

Removing contrastive learning caused a noticeable drop in performance, confirming it was key to helping the model distinguish between similar and different meanings.
Without cross-lingual feature mapping, accuracy dropped the most, proving that directly aligning features across languages is critical.
Language-specific adapters and dynamic weight adjustments also played an important role, especially in preserving unique language traits.

“This research isn’t just academic. In real-world scenarios—such as disaster response, global health communications or inclusive tech development—understanding low-resource languages can have life-saving consequences,” said Ruiming Tian, a co-author of the study and student in the M.S. in Artificial Intelligence. “It also matters for cultural preservation, giving digital tools access to languages that might otherwise be ignored in the AI revolution.”

The framework developed here offers a scalable, efficient way to close the gap between high- and low-resource languages. It shows that with the right techniques, AI can learn to understand not just the “big” languages but all the voices of the world.

“As AI becomes more deeply embedded in daily life, from phone apps to government services, ensuring it works well for everyone is both a technical and moral challenge,” said Dr. Li. “This research moves us one step closer to that goal.”

�ۿ۴�ý

YU News

News Channel