Leonard-Alexander Lieske, Mario Commodo, et al.
ACS Nano
With the advent of geospatial foundation models, new and unexplored use cases are emerging that require well-curated datasets for pre-training, fine-tuning, and inference. Currently, no standardized approach exists for creating such AI-ready geospatial datasets—each researcher typically develops custom scripts tailored to their specific needs. To address this gap, we introduce TerraKit, a comprehensive open-source Python library for discovering, retrieving, and processing geospatial data.
TerraKit enables users to define raster/vector annotations or a sampling scheme and specify desired satellites and data sources (e.g., EarthData, CDSE, GEE, Planetary Computer, STAC) through a simple configuration file. The toolkit automatically matches, downloads, and processes the relevant satellite imagery, aligns it with any provided labels, and splits it into patches based on user specifications. It also supports spatial train/val splits and exports datasets in standard formats such as WebDataset, TACO, or a structured folder format. TerraKit streamlines the pipeline from raw EO data to AI-ready datasets, accelerating the development of custom geospatial applications, and ensuring query and processing pipelines are reproducible across pre-training, fine-tuning and inference steps. By lowering the barrier to entry, it empowers a wider community to leverage foundation models for Earth observation.
Leonard-Alexander Lieske, Mario Commodo, et al.
ACS Nano
Paula Harder, Venkatesh Ramesh, et al.
EGU 2023
Lloyd Treinish, Nick Van De Giesen, et al.
EGU 2022
Bing Zhang, Mikio Takeuchi, et al.
NAACL 2025