Technical note
4 minute read

Introducing Thinking-in-Modalities with TerraMind

Figure 1: Comparison between a crop classification model with standard fine-tuning (left) and with Thinking-in-Modalities fine-tuning (right) in the IBM Geospatial Studio. The field in the middle of the image is only predicted as wheat by the TiM model.

TerraMind is a dual-scale, any-to-any foundation model for Earth observation, developed by IBM Research, ESA Φ-lab, and Jülich Supercomputing Centre. It learns joint token- and pixel-level representations across nine modalities and outperforms other models on the PANGAEA benchmark. The model can generate any missing modality as an intermediate step using compact tokens rather than full feature maps — an ability called Thinking in Modalities (TiM).

We tested TiM with the South Africa crop-type dataset from GEO-Bench and observed higher performance and improved visual results (Figure 1 above). While the wheat field in the image center is not detected by the standard fine-tuned model (left), the same model fine-tuned and operated in the TiM mode (right) correctly identifies the crop class (purple).

How does TiM work?

During pre-training, TerraMind learns the correlation between satellite images like Sentinel-1 and -2 and other modalities such as land-cover (LULC) or a vegetation index (NDVI) to enhance the prediction performance on downstream tasks. During TiM fine-tuning or inference, the model pauses for a moment, imagines a helpful but absent layer, appends the imagined tokens to its own input sequence, and then lets the fine-tuned encoder continue (Figure 2). Because the imagination lives in token space, we avoid the heavy diffusion decoding required by full image synthesis.

Screenshot 2025-10-20 at 3.49.35 PM.png
Figure 2: Standard fine-tuning combines a pre-trained encoder with a task-specific decoder and head. TerraMind enables to enhance fine-tuning by TiM-generating intermediate tokens of other modalities like land-cover maps to predict the extent of water with higher precision. Here, the LULC map is only visualized as an explanation, while TiM models do  not decode the tokens.

On flood segmentation using the Sen1Floods11 dataset, adding a synthetic LULC layer lifts the mean intersection over union (mIoU) by about 2 percentage points (pp). On South Africa crop-type data, fine-tuning with a generated NDVI/LULC map raises the mIoU from 41.9% to 42.7%. TiM requires multiple forward passes and processes the input tokens twice, doubling runtime, but eliminating the need for multiple modalities as raw inputs. In experiments, we observed that TiM works best for input modalities with limited information content and TiM modalities with complementary infromation. As such, the benefits for optical Sentinel-2 data is often limited to <2pp and sometimes even negligible. But, use cases with Sentinel-1 Synthetic Aperture Radar (SAR) inputs profit from generating land-cover or NDVI tokens and often gain up to 5pp.

Screenshot 2025-10-20 at 3.57.52 PM.png
Table 1: Comparision between standard fine-tuning and TiM tuning using the TerraMind-base model. TiM first generates land-use land-cover maps before predicting the target classes.

Thinking-in-Modalities does not have to be confined to Earth observation (EO) use cases. TerraMind just provided a test-bed with rich modalities like LULC or NDVI for satellite images. Cross-modal intermediate thinking is practical wherever one modality is missing, expensive, or noisy. TiM can be applied in other domains with a multimodal model: A model with RGB-text inputs could generate tokens representing infra-red images, to improve nighttime tracking. Or in the fields of robotics or augmented reality, generated depth or 3D skeleton tokens from 2D inputs could be used to enhance navigation in space.

Try out TiM with TerraTorch in two lines of code

All TerraMind backbones — standard and TiM-enabled — are available in TerraTorch, our fine-tuning toolkit for EO foundation models. We provide a tutorial for fine-tuning here. To turn normal fine-tuning into a TiM run, you only need to change the config YAML with the following statement:

backbone: terramind_v1_base_tim      # instead of terramind_v1_base
backbone_tim_modalities: [LULC]      # or S1GRD / NDVI / DEM / …

That is literally it. Training takes about twice as long, but is still lighter than detokenizing full-resolution raster images or using larger models. For multi-step TiM, you simply list several targets in backbone_tim_modalities. Expect further gains at the cost of longer training and inference time with more TiM modalities. We tested our recently released TerraMind.tiny with TiM and in some use cases, it even outperformed the standard .base and .large versions — while being faster, even with the extra compute required for TiM.

Watching the model ‘think’

The TiM models do not generate the intermediate layers as images but tokens, to save compute. However, you can decode the tokens into images for curiosity, reporting, or debugging. A detailed tutorial can be found here, with the basic steps being:

import torch
import rioxarray as rxr
import matplotlib.pyplot as plt
from terratorch import FULL_MODEL_REGISTRY

# Load the TerraMind generation model
model = FULL_MODEL_REGISTRY.build(
    'terramind_v1_base_generate',
    modalities=['S2L2A'],  # Input modalities
    output_modalities=['LULC'],  # TiM modalities
    pretrained=True,
    standardize=True,
)

# Load input
input = rxr.open_rasterio('S2_file.tif').values
input = torch.Tensor(input).unsqueeze(0)

# Run generation
with torch.no_grad():
  generated = model(input)

# Plot generated layer
lulc_map = generated['LULC'].argmax(dim=1).cpu().numpy()[0]
plt.imshow(lulc_map, vmin=0, vmax=9, interpolation='nearest')

TerraMind was trained using images with 224x224 pixels. Still, the token prediction generalizes well to larger inputs, so you can use TiM with any input size. If you like to run the generation on very large tiles, you can use the tiled inference implemented in TerraTorch, as demonstrated in this notebook.

The following example from Singapore shows a Sentinel-1 RTC generation based on Sentinel-2 L2A input. The clouds are correctly ignored during SAR generation, revealing features such as airport runways and ships in front of the coastline (Figure 3).

Screenshot 2025-10-20 at 3.49.56 PM.png
Figure 3: Generation example using a Sentinel-2 L2A tile from Singapore as input, left. TerraMind generated the Sentinel-1 RTC image, right, with a tiled-inference approach. The model correctly ignores all clouds and many features like airports or ships are visible in the S-1 RTC generation.

Imagine what’s next

TiM reframes missing-data problems as imagination problems. In remote sensing, these cases can include topography-aware landslide mapping, water mask-guided ship detection, or chained estimations, such as NDVI → biomass → yield. Beyond the Earth observation domain, we expect that any multimodal vision model that learned cross-modal structure can adopt the same TiM methodology. Please test it out and let us know.

To learn more about TerraMind, visit Hugging Face or arXiv.

Related posts