By Dave DeFusco
In our globally connected world, technology needs to understand peopleâno matter what language they speak. Whether you're using voice assistants, translating documents or asking questions online, artificial intelligence is increasingly expected to work across dozens of languages. But while tools like Google Translate and chatbots handle English and other major languages well, they often stumble when dealing with those that arenât widely spoken or studied.
So how can machines learn to understand languages that donât have large libraries of digital text or annotated examples? Thatâs the challenge a team of Katz School researchers led by Dr. David Li, director of the M.S. in Data Analytics and Visualization, set out to solve with a new framework that significantly improves how AI understands âlow-resource languagesââlanguages that lack the massive training datasets available for English, Spanish or Mandarin.
The Katz School team presented their study, âCross-Lingual Text Augmentation: A Contrastive Learning Approach for Low-Resource Languages, in March at IEEE SoutheastCon 2025 in Charlotte, N.C.
âOur work centers on a field called cross-lingual natural language understanding, which involves building systems that can learn from high-resource languages and apply that knowledge to others,â said Dr. Li. âOur approach combines clever data techniques and training methods to help machines âtransferâ what theyâve learned in one language to many othersâwithout needing massive amounts of new information.â
At the heart of todayâs language AI are models like XLM-RoBERTa and mBERTâpowerful tools trained on text from dozens of languages. These models are surprisingly good at capturing patterns that are shared across languages, such as sentence structure or word meaning. But their performance drops dramatically when they deal with languages that have little training data because these models rely heavily on examples.
If a language doesnât have many labeled datasetsâsentences paired with their meanings or categoriesâthe model canât learn the nuances it needs to perform well. And itâs not just about having enough data. Sometimes the data that is available comes from a narrow fieldâsay, medical journals or government documentsâso the model canât apply it easily to other domains like news articles or casual speech.
Traditional fixes, like creating synthetic data through back-translationâtranslating a sentence into another language and back againâor swapping in synonyms, help to a degree. But for truly underrepresented languages, even these strategies fall shortâespecially if good translation models donât exist for them. Thatâs where this new research takes things a step further.
The researchers designed a multi-pronged strategy that makes language models more flexible, efficient and accurate in multilingual settings. Their approach focuses on four main innovations:
- Better Data Augmentation: Instead of relying on just one method, the team combined several: back-translation, synonym swapping and even changing sentence structures. This mix of methods helped create more diverse, higher-quality training examples without introducing too much noise or error.
- Contrastive Learning: The model is trained to recognize when two sentences in different languages mean the same thingâand when they donât. This strengthens the modelâs ability to match meanings across languages, even if the surface words look nothing alike.
- Dynamic Weight Adjustment: When learning multiple languages, AI often either overgeneralizes or misses the subtle features of each language. This feature lets the model dynamically balance general knowledge with language-specific quirks, keeping it accurate without losing sensitivity to detail.
- Adaptation Layers: These are like special filters added to the model that help it tune its responses to a specific task or language. They make the model more flexible and help it perform well even with just a small amount of labeled data.
To see how their system measured up, the researchers tested it on three large datasets used in multilingual AI research: XNLI, which checks whether a model can understand logical relationships in sentences, like contradiction or agreement, across 15 languages; MLQA, which tests how well models answer questions in seven languages; and XTREME, a mega-benchmark covering 40 languages and a variety of tasks, from classification to structured prediction.
âIn all cases, our new framework outperformed traditional methods, especially in low-resource settings,â said Hang Yu, a co-author of the study and student in the M.S. in Artificial Intelligence. âThe biggest gains came when contrastive learning and augmentation were combinedâshowing that giving the model diverse, quality examples and helping it link meanings across languages are both essential.â
Even more impressive, the improvements came with only a small increase in computing power and memory use. That makes the framework a practical option for real-world applications, where resources and time are often limited.
To understand what really made the difference, the researchers ran an ablation studyâessentially, turning off one component at a time to see what impact it had. Hereâs what they found:
- Removing contrastive learning caused a noticeable drop in performance, confirming it was key to helping the model distinguish between similar and different meanings.
- Without cross-lingual feature mapping, accuracy dropped the most, proving that directly aligning features across languages is critical.
- Language-specific adapters and dynamic weight adjustments also played an important role, especially in preserving unique language traits.
âThis research isnât just academic. In real-world scenariosâsuch as disaster response, global health communications or inclusive tech developmentâunderstanding low-resource languages can have life-saving consequences,â said Ruiming Tian, a co-author of the study and student in the M.S. in Artificial Intelligence. âIt also matters for cultural preservation, giving digital tools access to languages that might otherwise be ignored in the AI revolution.â
The framework developed here offers a scalable, efficient way to close the gap between high- and low-resource languages. It shows that with the right techniques, AI can learn to understand not just the âbigâ languages but all the voices of the world.
âAs AI becomes more deeply embedded in daily life, from phone apps to government services, ensuring it works well for everyone is both a technical and moral challenge,â said Dr. Li. âThis research moves us one step closer to that goal.â