Every few decades, technology reshapes how humanity understands itself. The AI revolution, unfolding rapidly over the last three years, stands as one of those pivotal moments. Now, it is not just words, but the speech and melody behind them that technology is beginning to uncover and understand.
With large language models like ChatGPT gaining prominence, it has become clear that machines can mimic human communication impressively well. Yet, a deeper layer of interaction has remained untouched – one that is not found in words alone.
A recent study from Professor Elisha Moses’s lab at the Weizmann Institute of Science is now bringing this missing dimension into focus.
The research reveals that the melody of speech – known as prosody – is a structured, independent language in its own right. It has vocabulary, semantics, and syntax, waiting to be deciphered.
In everyday life, words form only a part of human interaction. Prosody – the music of speech – encompasses pitch changes, variations in loudness, tempo shifts, and tonal quality.
This expressive toolkit adds emotional and functional depth to communication, influencing meaning even when the words remain the same.
This is not a recent development. Studies show that both chimpanzees and whales use prosodic structures in their communication, suggesting that prosody predates language itself. In humans, a pause can change meaning dramatically.
A subtle rise in tone can signal curiosity or challenge. Speech without prosody, as in robotic voices, sounds unnatural and flat.
“Our study lays the foundation for an automated system to compile a ‘dictionary’ of prosody for every human language and for different speaker populations,” noted the researchers.
Dr. Nadav Matalon and Dr. Eyal Weinreb, leading the research from Moses’s lab, decided to study prosody like an unknown language.
The experts turned to vast databases of spontaneous English conversations: the CallHome Corpus and the Santa Barbara Corpus. Instead of relying on written or rehearsed speech, they sought the chaotic beauty of real-life conversations.
The team’s method involved breaking speech into Intonation Units (IUs). Each IU usually consists of one to four words, marked by coherent melody and rhythm. This structure allowed the team to represent each IU as a pitch-intensity vector and to cluster similar melodies without any prior assumptions.
“When the University of Oxford was tasked with compiling one, it asked the public to help with the workload by sending quotes showing the historical changes in the meaning of words,” explained the researchers.
“One of the main contributors was a prisoner who spent more than 20 years reading books and sending quotes. In our study, instead of collecting information by ourselves over the course of decades, we analyzed massive collections of audio recordings, using AI.”
From this clustering, the researchers uncovered roughly 200 distinct prosodic patterns. This number stands in stark contrast to the thousands of words in the core verbal vocabulary of English. Each prosodic pattern, lasting around a second, acted like a “word” in the hidden language of melody.
Despite the differences in individual voices, these melodic shapes appeared consistently across spontaneous conversations. Each shape could have several linguistic functions depending on context, yet typically expressed a dominant emotional attitude such as enthusiasm, skepticism, or curiosity.
“We discovered that each pattern has several linguistic functions,” explained Matalon. “For example, depending on the context, a pattern can define whether someone is asking a question or making a statement.”
“However, each pattern typically conveys one specific attitude of the speaker – such as curiosity, surprise or confusion – toward what’s being said.”
Understanding these prosodic patterns offers more than simple emotional coloring. It hints at a deeper, structured system where the melody itself carries grammatical-like information.
Beyond identifying basic prosodic “words,” the researchers discovered rules for how these melodic units combine. They found that certain prosodic patterns tend to appear in pairs, with one IU predicting the next through simple, memory-friendly rules resembling a Markov process.
“We noticed that there are patterns that tend to appear next to each other, in pairs, in spontaneous speech,” explained Weinreb.
“It’s a simple statistical system, in which the correct choice of the next unit in a sequence depends solely on the previous one. This system works well for spontaneous conversation because it requires planning only a few seconds ahead, which is just as long as short-term memory lasts.”
Unlike written grammar, where sentences unfold over multiple clauses and ideas, spoken language often plans just one or two seconds ahead. This dynamic allows humans to communicate flexibly in real time, yet still follow an underlying structure.
One of the study’s more striking findings is the contrast between spontaneous and scripted speech. When the researchers analyzed professional audiobooks, they found that scripted speech lacked the natural prosodic pairings common in conversation.
While spontaneous speech relied on short IU pairs to build meaning, audiobooks showed longer, more rigid melodic sequences. Even amateur audiobook readings had more dynamic prosody than polished professional recordings.
These differences underscore the need to study real-world conversations, not just polished texts, to understand human communication fully. It also suggests that if AI systems are trained only on written or scripted language, they will miss essential elements of how humans really talk.
Teaching AI to grasp prosody could profoundly change how machines interact with people. The researchers envision future systems that not only process words but also pick up emotional cues from the melodies in speech.
“Imagine if Siri could understand from the melody of your voice how you feel about a certain subject, what’s important to you or whether you think you know better than her, and that she could adapt her response to make it sound enthusiastic or sad,” said Weinreb.
“We already have brain implants that convert neural activity into speech for people who can’t speak. If we can teach prosody to a computer model, we’ll be adding a significant layer of human expression that robotic systems currently lack.”
Prosody may even enhance brain-computer interfaces, making synthetic speech generated from neural activity sound more human and emotionally rich.
The study acknowledges that human speech carries intrinsic noise. Everyday conversation is filled with interruptions, repairs, and overlapping voices. Clustering prosodic patterns must navigate this chaos, and perfect separation of prosodic “words” remains elusive.
Still, the researchers are optimistic. Expanding datasets, refining clustering methods, and integrating prosodic and textual data may allow future models to predict both what is said and how it is meant.
They also see opportunities to apply their methods to other languages, cultures, and special groups like children or elderly speakers.
This work was made possible by a collaborative team including Dr. Dominik Freche, Dr. Erez Volk from NeuraLight Inc., Dr. Tirza Biron, and Professor David Biron from the University of Chicago.
Together, they combined expertise in physics, linguistics, neuroscience, and computer science to unveil this hidden dimension of language.
Their collective effort now points to a future where machines might not just understand words but truly hear human beings – emotion, intent, and all.
The study is published in the journal Proceedings of the National Academy of Sciences.
—–
Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.
—–