New tech called 'mind captioning' turns thoughts and mental images into simple text
12-07-2025

New tech called 'mind captioning' turns thoughts and mental images into simple text

A person lies in an MRI scanner, watching rapid video clips. Outside the room, a computer tries to write sentences that reflect what is happening in that person’s mind.

The computer’s goal is to produce not just a single word like “dog” or “car,” but full sentences that spell out who is acting, what they are doing, and where it happens.

Although scientists have made progress using brain-activity scans to convert the words we think into text, translating the rich, complex images in our minds into language is still far more difficult, says lead author Tomoyasu Horikawa.

Horikawa calls this approach “mind captioning” because the system turns distinct patterns of human brain activity into short text captions.

Earlier experiments often used the label “mind reading” and focused on easier tasks, such as guessing which object someone was viewing from a short list or matching brain activity to spoken words.

Those systems might indicate that a person was looking at a face or listening to the word “house,” yet they usually could not describe an entire situation with events and relationships.

Mind captioning and fMRI

To build mind captioning, the team combined functional MRI, or fMRI, with large language models. fMRI tracks changes in blood flow across the brain over time, and this provides a slow but detailed view of which areas become more active.

Six volunteers lay in the scanner and watched thousands of very short video clips showing everyday scenes: people doing simple actions, objects moving around, different locations, and other common events.

While each clip played, the scanner recorded whole-brain activity, frame by frame, so the data set for each person grew very large.

Each clip came with a caption written ahead of time by human viewers, with examples such as “A man is playing guitar on a stage” or “A child is petting a dog in a yard.”

Those sentences went into a language model that converts text into a numerical representation called a “meaning vector,” which is a set of numbers that represents the meaning of the sentence.

For every volunteer, a separate decoder learned to map that person’s pattern of fMRI activity for a given clip onto the corresponding numerical representation for that clip’s caption, linking brain responses to sentence meanings.

Teaching the model to write

Once the decoder could turn brain activity into these numerical meaning representations, the system still needed to turn them back into readable language.

A second language model handled this step by starting from almost no text – sometimes just a placeholder token – and proposing an initial sentence.

The system checked how closely the meaning of that sentence matched the meaning representation predicted from the brain data.

It then repeatedly masked out some words and rewrote them, keeping versions that fit the decoded meaning better and gradually shaping a more coherent sentence.

Results and accuracy

The sentences that emerged were far from perfect, yet they often came close to the original captions for the clips. The system sometimes misidentified specific objects.

In one instance, for example, the system called an animal a “wolf” when it was actually a dog, but it still captured the main action and structure of the scene, such as an animal chasing something or a person holding an object.

To test performance more strictly, the researcher used only the generated text to pick which video a person was watching from a group of candidates.

The system chose the correct clip far more often than chance and outperformed earlier methods based on simpler representations.

Reading out remembered scenes

After the training phase, volunteers stayed in the scanner and silently recalled specific clips they had seen earlier, without any visual input on the screen.

The same decoding pipeline took the fMRI data from this recall period and produced descriptions that matched the remembered clips better than unrelated ones.

Accuracy dropped compared with when volunteers were actually watching the clips, but it remained clearly above chance. This means the method can reflect internal mental content as well as immediate sensory input.

Where the brain stores meaning

Horikawa also examined where in the brain the decodable patterns could be found.

The method still worked, even when traditional language areas were left out of the analysis. This result points to high-level visual and parietal regions as carrying rich information about the meaning of scenes.

Models that focused mainly on visual details, such as shapes and textures, fit early sensory areas better. Conversely, models that used language-based semantic features matched activity in those higher regions more closely.

The data suggests that those areas care more about concepts and relationships than about raw appearance.

Future of mind captioning

For basic neuroscience, mind captioning opens a path to study how the brain represents complex events and thoughts at the level of detailed sentences.

For medicine and technology, it hints at future tools that could help people who cannot speak or move by training a decoder tailored to an individual and pairing it with sensors that record brain activity, so that at least part of their internal experience could reach the outside world as text.

The mind captioning decoder doesn’t pull hidden secrets from the mind or understand a person the way a close friend does, yet it already turns complex patterns of neural activity into structured language, something that has only recently become possible with this mix of brain imaging and modern language models.

The full study was published in the journal Science Advances.

—–

Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.

Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.

—–

News coming your way
The biggest news about our planet delivered to you each day
Subscribe