Introducing Thinking-in-Modalities with TerraMind

Figure 1: Comparison between a crop classification model with standard fine-tuning (left) and with Thinking-in-Modalities fine-tuning (right) in the IBM Geospatial Studio. The field in the middle of the image is only predicted as wheat by the TiM model.

TerraMind is a dual-scale, any-to-any foundation model for Earth observation, developed by IBM Research, ESA Φ-lab, and Jülich Supercomputing Centre. It learns joint token- and pixel-level representations across nine modalities and outperforms other models on the PANGAEA benchmark. The model can generate any missing modality as an intermediate step using compact tokens rather than full feature maps — an ability called Thinking in Modalities (TiM).

We tested TiM with the South Africa crop-type dataset from GEO-Bench and observed higher performance and improved visual results (Figure 1 above). While the wheat field in the image center is not detected by the standard fine-tuned model (left), the same model fine-tuned and operated in the TiM mode (right) correctly identifies the crop class (purple).

How does TiM work?

During pre-training, TerraMind learns the correlation between satellite images like Sentinel-1 and -2 and other modalities such as land-cover (LULC) or a vegetation index (NDVI) to enhance the prediction performance on downstream tasks. During TiM fine-tuning or inference, the model pauses for a moment, imagines a helpful but absent layer, appends the imagined tokens to its own input sequence, and then lets the fine-tuned encoder continue (Figure 2). Because the imagination lives in token space, we avoid the heavy diffusion decoding required by full image synthesis.

Screenshot 2025-10-20 at 3.49.35 PM.png — Figure 2: Standard fine-tuning combines a pre-trained encoder with a task-specific decoder and head. TerraMind enables to enhance fine-tuning by TiM-generating intermediate tokens of other modalities like land-cover maps to predict the extent of water with higher precision. Here, the LULC map is only visualized as an explanation, while TiM models do not decode the tokens.

On flood segmentation using the Sen1Floods11 dataset, adding a synthetic LULC layer lifts the mean intersection over union (mIoU) by about 2 percentage points (pp). On South Africa crop-type data, fine-tuning with a generated NDVI/LULC map raises the mIoU from 41.9% to 42.7%. TiM requires multiple forward passes and processes the input tokens twice, doubling runtime, but eliminating the need for multiple modalities as raw inputs. In experiments, we observed that TiM works best for input modalities with limited information content and TiM modalities with complementary infromation. As such, the benefits for optical Sentinel-2 data is often limited to <2pp and sometimes even negligible. But, use cases with Sentinel-1 Synthetic Aperture Radar (SAR) inputs profit from generating land-cover or NDVI tokens and often gain up to 5pp.

Screenshot 2025-10-20 at 3.57.52 PM.png — Table 1: Comparision between standard fine-tuning and TiM tuning using the TerraMind-base model. TiM first generates land-use land-cover maps before predicting the target classes.

Thinking-in-Modalities does not have to be confined to Earth observation (EO) use cases. TerraMind just provided a test-bed with rich modalities like LULC or NDVI for satellite images. Cross-modal intermediate thinking is practical wherever one modality is missing, expensive, or noisy. TiM can be applied in other domains with a multimodal model: A model with RGB-text inputs could generate tokens representing infra-red images, to improve nighttime tracking. Or in the fields of robotics or augmented reality, generated depth or 3D skeleton tokens from 2D inputs could be used to enhance navigation in space.

Try out TiM with TerraTorch in two lines of code

All TerraMind backbones — standard and TiM-enabled — are available in TerraTorch, our fine-tuning toolkit for EO foundation models. We provide a tutorial for fine-tuning here. To turn normal fine-tuning into a TiM run, you only need to change the config YAML with the following statement:

backbone: terramind_v1_base_tim      # instead of terramind_v1_base
backbone_tim_modalities: [LULC]      # or S1GRD / NDVI / DEM / …

That is literally it. Training takes about twice as long, but is still lighter than detokenizing full-resolution raster images or using larger models. For multi-step TiM, you simply list several targets in backbone_tim_modalities. Expect further gains at the cost of longer training and inference time with more TiM modalities. We tested our recently released TerraMind.tiny with TiM and in some use cases, it even outperformed the standard .base and .large versions — while being faster, even with the extra compute required for TiM.

Watching the model ‘think’

The TiM models do not generate the intermediate layers as images but tokens, to save compute. However, you can decode the tokens into images for curiosity, reporting, or debugging. A detailed tutorial can be found here, with the basic steps being:

import torch
import rioxarray as rxr
import matplotlib.pyplot as plt
from terratorch import FULL_MODEL_REGISTRY

# Load the TerraMind generation model
model = FULL_MODEL_REGISTRY.build(
    'terramind_v1_base_generate',
    modalities=['S2L2A'],  # Input modalities
    output_modalities=['LULC'],  # TiM modalities
    pretrained=True,
    standardize=True,
)

# Load input
input = rxr.open_rasterio('S2_file.tif').values
input = torch.Tensor(input).unsqueeze(0)

# Run generation
with torch.no_grad():
  generated = model(input)

# Plot generated layer
lulc_map = generated['LULC'].argmax(dim=1).cpu().numpy()[0]
plt.imshow(lulc_map, vmin=0, vmax=9, interpolation='nearest')

TerraMind was trained using images with 224x224 pixels. Still, the token prediction generalizes well to larger inputs, so you can use TiM with any input size. If you like to run the generation on very large tiles, you can use the tiled inference implemented in TerraTorch, as demonstrated in this notebook.

The following example from Singapore shows a Sentinel-1 RTC generation based on Sentinel-2 L2A input. The clouds are correctly ignored during SAR generation, revealing features such as airport runways and ships in front of the coastline (Figure 3).

Screenshot 2025-10-20 at 3.49.56 PM.png — Figure 3: Generation example using a Sentinel-2 L2A tile from Singapore as input, left. TerraMind generated the Sentinel-1 RTC image, right, with a tiled-inference approach. The model correctly ignores all clouds and many features like airports or ships are visible in the S-1 RTC generation.

Imagine what’s next

TiM reframes missing-data problems as imagination problems. In remote sensing, these cases can include topography-aware landslide mapping, water mask-guided ship detection, or chained estimations, such as NDVI → biomass → yield. Beyond the Earth observation domain, we expect that any multimodal vision model that learned cross-modal structure can adopt the same TiM methodology. Please test it out and let us know.

To learn more about TerraMind, visit Hugging Face or arXiv.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

IBM and Kaggle launch new AI leaderboards for enterprise tasks
News
Mike Murphy
02 Dec 2025
IBM Granite 4.0: Hyper-efficient, high performance hybrid models for India
Technical note
Rudra Murthy, Rameswar Panda, Jaydeep Sen, and Amith Singhee
28 Nov 2025
- AI
- Generative AI
IBM and ESA open-source AI models trained on a new dataset for analyzing extreme floods and wildfires
Release
Kim Martineau
25 Nov 2025
Accelerating AI inference with IBM Storage Scale
Technical note
Yue Zhu, Radu Stoica, Animesh Trivedi, Jonathan Terner, Frank Schmuck, Jeremy Cohn, Christof Schmitt, Anthony Hsu, Guy Margalit, Vasily Tarasov, Swaminathan Sundararaman, Talia Gershon, and Vincent Hsu
18 Nov 2025

How does TiM work?

Try out TiM with TerraTorch in two lines of code

Watching the model ‘think’

Imagine what’s next

Related posts

IBM and Kaggle launch new AI leaderboards for enterprise tasks

IBM Granite 4.0: Hyper-efficient, high performance hybrid models for India

IBM and ESA open-source AI models trained on a new dataset for analyzing extreme floods and wildfires

Accelerating AI inference with IBM Storage Scale