Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

Kaur, Amandeep; Purohit, Mirali; Muhawenayo, Gedeon; Rolf, Esther; Kerner, Hannah

Computer Science > Computer Vision and Pattern Recognition

arXiv:2604.21104 (cs)

[Submitted on 22 Apr 2026]

Title:Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

Authors:Amandeep Kaur, Mirali Purohit, Gedeon Muhawenayo, Esther Rolf, Hannah Kerner

View PDF HTML (experimental)

Abstract:New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high-performing pretraining dataset. We open-sourced 7 new pretraining datasets, pretrained models, and our experimental framework at this https URL.

Comments:	Accepted at EarthVision workshop, CVPR 2026
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2604.21104 [cs.CV]
	(or arXiv:2604.21104v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2604.21104

Submission history

From: Amandeep Kaur [view email]
[v1] Wed, 22 Apr 2026 21:43:03 UTC (1,774 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

Submission history

Access Paper:

Current browse context:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

Submission history

Access Paper:

Current browse context:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators