Introducing Thinking-in-Modalities with TerraMind
TerraMind is a dual-scale, any-to-any foundation model for Earth observation, developed by IBM Research, ESA Φ-lab, and Jülich Supercomputing Centre. It learns joint token- and pixel-level representations across nine modalities and outperforms other models on the PANGAEA benchmark. The model can generate any missing modality as an intermediate step using compact tokens rather than full feature maps — an ability called Thinking in Modalities (TiM).
We tested TiM with the South Africa crop-type dataset from GEO-Bench and observed higher performance and improved visual results (Figure 1 above). While the wheat field in the image center is not detected by the standard fine-tuned model (left), the same model fine-tuned and operated in the TiM mode (right) correctly identifies the crop class (purple).
How does TiM work?
During pre-training, TerraMind learns the correlation between satellite images like Sentinel-1 and -2 and other modalities such as land-cover (LULC) or a vegetation index (NDVI) to enhance the prediction performance on downstream tasks. During TiM fine-tuning or inference, the model pauses for a moment, imagines a helpful but absent layer, appends the imagined tokens to its own input sequence, and then lets the fine-tuned encoder continue (Figure 2). Because the imagination lives in token space, we avoid the heavy diffusion decoding required by full image synthesis.
On flood segmentation using the Sen1Floods11 dataset, adding a synthetic LULC layer lifts the mean intersection over union (mIoU) by about 2 percentage points (pp). On South Africa crop-type data, fine-tuning with a generated NDVI/LULC map raises the mIoU from 41.9% to 42.7%. TiM requires multiple forward passes and processes the input tokens twice, doubling runtime, but eliminating the need for multiple modalities as raw inputs. In experiments, we observed that TiM works best for input modalities with limited information content and TiM modalities with complementary infromation. As such, the benefits for optical Sentinel-2 data is often limited to <2pp and sometimes even negligible. But, use cases with Sentinel-1 Synthetic Aperture Radar (SAR) inputs profit from generating land-cover or NDVI tokens and often gain up to 5pp.
Thinking-in-Modalities does not have to be confined to Earth observation (EO) use cases. TerraMind just provided a test-bed with rich modalities like LULC or NDVI for satellite images. Cross-modal intermediate thinking is practical wherever one modality is missing, expensive, or noisy. TiM can be applied in other domains with a multimodal model: A model with RGB-text inputs could generate tokens representing infra-red images, to improve nighttime tracking. Or in the fields of robotics or augmented reality, generated depth or 3D skeleton tokens from 2D inputs could be used to enhance navigation in space.
Try out TiM with TerraTorch in two lines of code
All TerraMind backbones — standard and TiM-enabled — are available in TerraTorch, our fine-tuning toolkit for EO foundation models. We provide a tutorial for fine-tuning here. To turn normal fine-tuning into a TiM run, you only need to change the config YAML with the following statement:
backbone: terramind_v1_base_tim # instead of terramind_v1_base
backbone_tim_modalities: [LULC] # or S1GRD / NDVI / DEM / …
That is literally it. Training takes about twice as long, but is still lighter than detokenizing full-resolution raster images or using larger models. For multi-step TiM, you simply list several targets in backbone_tim_modalities. Expect further gains at the cost of longer training and inference time with more TiM modalities. We tested our recently released TerraMind.tiny with TiM and in some use cases, it even outperformed the standard .base and .large versions — while being faster, even with the extra compute required for TiM.
Watching the model ‘think’
The TiM models do not generate the intermediate layers as images but tokens, to save compute. However, you can decode the tokens into images for curiosity, reporting, or debugging. A detailed tutorial can be found here, with the basic steps being:
import torch
import rioxarray as rxr
import matplotlib.pyplot as plt
from terratorch import FULL_MODEL_REGISTRY
# Load the TerraMind generation model
model = FULL_MODEL_REGISTRY.build(
'terramind_v1_base_generate',
modalities=['S2L2A'], # Input modalities
output_modalities=['LULC'], # TiM modalities
pretrained=True,
standardize=True,
)
# Load input
input = rxr.open_rasterio('S2_file.tif').values
input = torch.Tensor(input).unsqueeze(0)
# Run generation
with torch.no_grad():
generated = model(input)
# Plot generated layer
lulc_map = generated['LULC'].argmax(dim=1).cpu().numpy()[0]
plt.imshow(lulc_map, vmin=0, vmax=9, interpolation='nearest')
TerraMind was trained using images with 224x224 pixels. Still, the token prediction generalizes well to larger inputs, so you can use TiM with any input size. If you like to run the generation on very large tiles, you can use the tiled inference implemented in TerraTorch, as demonstrated in this notebook.
The following example from Singapore shows a Sentinel-1 RTC generation based on Sentinel-2 L2A input. The clouds are correctly ignored during SAR generation, revealing features such as airport runways and ships in front of the coastline (Figure 3).
Imagine what’s next
TiM reframes missing-data problems as imagination problems. In remote sensing, these cases can include topography-aware landslide mapping, water mask-guided ship detection, or chained estimations, such as NDVI → biomass → yield. Beyond the Earth observation domain, we expect that any multimodal vision model that learned cross-modal structure can adopt the same TiM methodology. Please test it out and let us know.
To learn more about TerraMind, visit Hugging Face or arXiv.
Related posts
- NewsPeter Hess
Introducing CUGA: The enterprise-ready configurable generalist agent
ReleaseAsaf Adi and Avi YaeliLightweight tools for ‘steering’ LLMs down the right path
ResearchKim MartineauAI-powered satellites will upend how we observe our changing planet
ReleaseMike Murphy