Integrating Multisource Earth Observation Data
for Soil Nutrient Prediction
Overview: Combining heterogeneous satellite data (optical, SAR, thermal, vegetative indices, etc.) with
ground soil samples requires careful spatiotemporal alignment and feature engineering. We summarize
a workflow for fusing diverse RS products and ancillary data to predict soil nutrients (ppm) and mitigate
extrapolation. Key steps include sensor-specific preprocessing, temporal compositing, and combining
sensors to capture complementary information 1 2 . Machine learning (ML) models (e.g. Random
Forest, Gradient Boosting, BART) are then trained with spatial cross-validation to generalize to new
regions. The goal is robust prediction with low RMSE by leveraging multi-temporal summaries and
domain knowledge.
Data Fusion and Alignment
• Spatial Alignment: All data layers (Sentinel-1/2, MODIS products, DEM, soil maps) should be
reprojected and resampled onto a common grid. For example, Zhang et al. resampled SAR,
optical, LST, soil texture, and climate layers to a unified 10 m grid via cubic splines 3 .
Alternatively, coarse-resolution covariates (e.g. MODIS) can be aggregated (e.g. averaged) to
match finer pixels, or vice versa, depending on model design. Geostatistical approaches (e.g.
collocated block cokriging) can explicitly fuse information at different supports, correcting for
resolution differences 4 5 .
• Temporal Alignment: Different sensors have different revisit cycles (daily SAR, ~5-day Sentinel-2,
8–16-day MODIS composites). Rather than matching each date, most studies compute
aggregated features over relevant periods. A common approach is seasonal compositing: e.g.
compute median or mean reflectance/indices for each season of the year. Zhang et al. (2024)
generated seasonal median composites of Sentinel-2 bands (blue, green, red, NIR, SWIR1, SWIR2)
for each of the three years preceding soil sampling 2 . Similarly, seasonal means of MODIS LST
or soil moisture (MOD11A1, SMAP) over the same multi-year windows were used 6 . This
captures “legacy effects” of past seasons on current soil conditions 7 . Table 1 summarizes
common alignment strategies:
Alignment
Description Example/Ref.
Technique
Aggregate observations into seasonal or
Temporal Seasonal median of Sentinel-2
annual metrics (mean, median, percentile)
Composites reflectance per year 2 .
to match sampling date.
Compute features from entire sensor time
Time-Series 3-year summaries of MODIS
series (e.g. mean, std, trend, cumulative
Statistics LST and precipitation 8 .
indices).
Use imagery closest to soil sampling (often
Sampling-Date Landsat-5 image at sampling
post-harvest to minimize vegetation
Snapshots time (post-harvest) 9 .
masking) 9 .
1
Alignment
Description Example/Ref.
Technique
Combine SAR backscatter with optical
Object-based S1+S2 regression
Sensor Fusion indices (e.g. NDVI) to exploit
gave RMSE 4.94% vs 6.41%
(SAR + Optical) complementary soil/moisture sensitivity
using single sensor 10 .
10 .
(Optional) Use algorithms like STARFM/ Weight-based fusion of
Spatio-Temporal
ESTARFM to fuse fine/coarse imagery Landsat/MODIS for vegetation
Fusion (STF)
(usually for generating high-res time series). 11 (framework reviewed).
Resample coarse features to fine grid (or
Resampling/Co- Zhang et al. upsampled all
vice versa) or use block kriging to reconcile
kriging covariates to 10 m 3 .
spatial support 4 .
Citation: Temporal compositing and multi-year feature summaries are widely recommended for soil
property modeling 2 8 . Object-based fusion of Sentinel-1 and -2 data has improved soil moisture
retrieval 10 and is analogous to fusing SAR/optical for soil nutrients. Resampling disparate grids to a
common support (or using geostatistical fusion like block-cokriging 4 ) ensures spatial consistency.
• Preprocessing and Quality Control: Apply radiometric corrections, cloud masking, and BRDF
corrections where appropriate. For Sentinel-2, use cloud flags or QA layers to filter contaminated
pixels. The MODIS MCD43A4 product provides BRDF-corrected surface reflectance, which can
reduce illumination/view angle effects when comparing dates. Sentinel-1 backscatter should be
calibrated to sigma-naught and, if necessary, terrain-corrected for incidence angle. In practice,
poorly sampled dates can be gap-filled via interpolation or omitted when aggregating. Metadata
(e.g. Sentinel-2 cloud %, orbital parameters) can itself be used as features or quality metrics.
Feature Engineering and Aggregation
• Vegetation Indices: Compute indices that correlate with biomass, residue, or vegetative cover,
which influence nutrient dynamics. Common indices include NDVI, EVI (enhanced vegetation
index), NDWI (water index), soil-adjusted or angle-adjusted vegetation indices. For Sentinel-2,
Zhang et al. computed indices from B2, B3, B4, B8, B11, B12 in seasonal composites 2 . MODIS
MOD13Q1 already provides 16-day NDVI/EVI; these can complement Sentinel-derived indices
where clouds persist. Derived indices capture greenness and phenology (e.g. peak NDVI, time of
peak) which relate to crop uptake and soil exposure.
• SAR Features: From Sentinel-1’s VH and VV backscatter, derive polarization ratios (e.g. VH/VV) or
differences, which can enhance sensitivity to surface roughness and moisture. Compute
statistics (mean, std) of backscatter over seasons, or perform texture analyses on the SAR image
to capture structural variations. Prior studies show that combining SAR backscatter with optical
indices yields better soil property estimates than either alone 10 . For instance, Attarzadeh et al.
found the SAR+optical object-based approach markedly reduced RMSE in soil moisture retrieval
10 .
• Thermal and Moisture Metrics: Land-surface temperature (MOD11A1) and evapotranspiration
(MOD16A2) capture environmental stress and water balance. Aggregate daily LST into seasonal
means and extremes around the growing season 6 . Summing or averaging the 8-day
evapotranspiration (AET/PET) over the crop cycle yields a proxy for water use and potential
2
nutrient flushing. Including such agro-climate summaries helps account for variable nutrient
mineralization and mobilization conditions.
• Static and Ancillary Features: Include topography, soil texture, climate normals and other
static predictors. Digital Elevation Model (DEM) derivatives (elevation, slope, aspect) often
correlate with soil formation processes. Soil texture or parent material (e.g. from SoilGrids) are
strong controls on nutrient retention 12 . Bioclimatic variables from WorldClim (temperature,
precipitation means/extremes) can serve as proxies for long-term weather patterns 13 . These
features fill spatial gaps where RS data are absent and reduce extrapolation risk by providing
location-specific context. As Zhang et al. demonstrated, combining climate, terrain, soil, and
multi-sensor RS yielded the most important predictors for SOC 1 14 .
• Temporal Derivatives and Anomalies: Besides raw values, compute change metrics (e.g.
difference between pre- and post-season indices) to capture management effects (tillage,
harvest). Phenological metrics (start/end of season, peak greenness date) can be derived via
time-series analysis (e.g. fitting double logistic curves or using the time-series of NDVI). Such
dynamic features may relate to nutrient uptake or residue cover. If multiple years of imagery are
available, include year-to-year variations to capture unusual conditions (drought year vs wet
year).
• Feature Selection and Dimensionality Reduction: With dozens or hundreds of potential
features, use selection techniques to avoid overfitting. Tree-based models inherently rank
variable importance, but explicit methods like Recursive Feature Elimination (RFE) with SVM or
variable selection in BART can prune redundant predictors 15 . Lachgar et al. applied RFE to SVM
and BART variable selection to narrow ~42 predictors to 4–6 most influential ones in each zone
15 . Such selection or principal component analysis may improve model parsimony and
generalization.
Modeling Techniques and Best Practices
• Model Choice: Nonlinear regression models are standard in soil mapping. Random Forest (RF)
and gradient boosting (e.g. XGBoost, LightGBM) often yield strong performance with large
feature sets. These ensemble trees handle nonlinearity and interactions without much tuning.
Zhang et al. used RF and XGBoost (as well as neural nets) for SOC and found them effective 16 .
Bayesian Additive Regression Trees (BART) have gained attention for providing uncertainty
estimates; in one study BART outperformed SVM and ordinary kriging in predicting soil
phosphorus, yielding lower RMSE (∼1.9 ppm vs 5–7 ppm) 17 . Support Vector Regression (SVR)
and Neural Networks (NNs) (MLP) can also model complex relations but may require more
tuning or regularization. Table 2 summarizes common options:
Algorithm Notes Example/Ref.
Ensemble of decision trees; robust to Widely used in digital soil
Random Forest (RF)
overfitting; outputs feature importance. mapping.
Sequential tree boosting; often achieves
Gradient Boosting Used by Zhang et al. for SOC
top accuracy; needs hyperparameter
(XGBoost, etc) prediction 16 .
tuning.
3
Algorithm Notes Example/Ref.
Bayesian Additive Outperformed SVM and
Tree ensemble with Bayesian framework;
Regression Trees kriging for soil P (RMSE
provides credible intervals.
(BART) ≈1.9 ppm) 17 .
Support Vector Kernel-based regression; effective for Competitively used for soil
Regression (SVR) small- to mid-size datasets. TN 18 .
Flexible function approximator; can
Artificial Neural One branch in SOC study
capture deep nonlinearities; benefits from
Network (ANN) (A3-ANN) 16 .
large data and tuning.
Geostatistical prediction exploiting spatial Lachgar et al. applied
Kriging / Regression
autocorrelation; can incorporate covariates ordinary kriging as baseline
Kriging
(regression kriging). 19 .
Combines multiple models (e.g. RF+GBDT) Often improves accuracy if
Ensemble/Stacking
to leverage strengths of each. tuned properly.
Zhang et al. reduced ~42
Feature Selection
Reduces dimensions to avoid overfitting. predictors to ~6 using RFE/
(RFE, PCA)
BART 15 .
Citation: Ensemble trees (RF, XGBoost) are routinely effective for soil property modeling, especially with
high-dimensional covariates 16 . Incorporating geostatistics (kriging) can handle residual spatial
structure, though pure kriging often underperforms ML when many predictors are available 17 .
Feature-selection methods like RFE or BART’s variable selection help focus on the most relevant
predictors 15 .
• Hyperparameter Tuning: Use cross-validation (CV) to tune model parameters (number of trees,
depth, learning rate, etc.). Given spatial data, one should use spatially explicit CV (e.g. block CV or
leave-one-area-out) to avoid inflated performance. For example, in Lachgar et al. fields were split
into east/west zones for validation 20 (mimicking holding out new fields). Hyperparameter grids
or Bayesian optimization can optimize performance while avoiding overfitting.
• Data Fusion in Modeling: Some approaches fuse data within the model. For example, one could
stack ML with geostatistics (regression kriging), or use multi-output/multi-task learning if
nutrients are correlated. Deep learning methods (e.g. LSTM, CNN) can in principle ingest raw
time series or imagery, but require large training sets and are less common in soil mapping
currently. Transfer learning (pretraining on one region) is an advanced option if test regions
differ greatly, but data is likely insufficient for CNNs.
• Uncertainty Estimation: For decision-making (e.g. nutrient management), quantify prediction
uncertainty. BART inherently gives posterior intervals. Alternatively, use quantile regression
forests, prediction intervals from ensemble spread, or Monte Carlo dropout in NNs.
Geostatistical cokriging naturally provides error variance 4 . Mapping uncertainty helps identify
where extrapolation is risky.
4
Validation Strategy and Extrapolation Mitigation
• Spatial Cross-Validation: Because the test strip lies outside the training footprint, use a
validation scheme that mimics this spatial gap. Rather than random split, implement block or
cluster CV (e.g. leave-one-farm/zone-out). Lachgar et al. clustered fields by region to train/test
20 , which is analogous to leaving out a spatial block. This yields more realistic error estimates
than point-wise CV. Validation metrics should include RMSE (matching evaluation) and R², but
also assess bias and coverage of uncertainty intervals.
• Feature Space Comparison: Before modeling, compare training vs test feature distributions. If
certain covariate ranges in the test area are not seen in training, predictions will be poor. Where
possible, expand training to include similar conditions (e.g. nearby regions or older samples).
Domain adaptation techniques (reweighting samples, adding location-specific terms) can
partially address shift.
• Geographic Encoding: Including coordinates (lat/lon) or spatial eigenvectors in the model can
help capture large-scale trends and reduce blind extrapolation. For example, kriging with
geographic coordinates models residual spatial structure. However, be cautious that using
location can overly bias towards known samples.
• Ancillary Data for Extrapolation: Since the test strip has a different “domain,” ensure the model
has input features that distinguish it. For instance, if climate or soil type differs, include those
static maps. If crop management differs, include irrigation or crop calendar data if available. In
absence of data, regularize the model to be less sensitive to fine-scale noise (e.g. limiting tree
depth).
• Geostatistical Enhancement: If spatial autocorrelation is strong, a hybrid approach can help:
first predict with ML, then krige the residuals using neighboring training points (regression-
kriging). Zhang et al. mention using 3D geostatistics (coefficient of coregionalization) to fuse
multiple supports 5 . While complex, such multivariate geostatistics can reduce unexplained
spatial variance and give more coherent maps.
• Progressive Validation: If possible, perform a pilot prediction on the test strip (without true
labels) to check for anomalies. Alternatively, cross-check with existing soil maps or field experts.
If large errors are found, revisit feature selection or consider region-specific models.
Summary Workflow
1. Compile and Preprocess Data: Gather all RS layers and static maps. Reproject to common CRS.
Apply corrections (atmospheric, cloud mask, BRDF). Resample to a common grid (or define multi-
scale support).
2. Link to Ground Samples: Match each soil-sample PID to nearest RS pixels. Record the sample
date.
3. Temporal Feature Engineering: Around each sample date, compute summaries: e.g. for each
sensor, derive seasonal composites or statistics as appropriate (medians, means, max NDVI, LST,
etc.) 2 21 . For multi-year context, include similar summaries for prior years (capturing legacy
effects) 8 6 .
4. Spatial/Context Features: Extract elevation, slope, soil texture, long-term climate normals at
each sample location.
5
5. Feature Selection: Optionally reduce feature set via importance ranking or correlation analysis
to remove collinear or uninformative variables. Methods like RFE can narrow down predictors
15 .
6. Model Training: Use ML regression (RF, XGBoost, BART, etc.) to fit nutrients. Tune
hyperparameters with spatially-aware CV (block CV or leave-one-area-out) to approximate test
conditions 20 .
7. Model Evaluation: Evaluate with hold-out data (or pseudo-strips) using RMSE and R². Check
residual spatial patterns; if strong, consider regression-kriging of residuals.
8. Prediction on Test Strip: Apply the finalized model to test-area covariates, generating nutrient
maps. Compute prediction uncertainty.
9. Assessment of Extrapolation: Map covariate ranges to ensure test samples lie within training
feature space. Highlight any out-of-range predictions.
By systematically fusing multi-sensor RS data (optical, SAR, thermal, evapotranspiration) with soil and
environmental covariates – and by summarizing these data over relevant seasons/years 2 8 – the
model captures both current crop conditions and background “soil signatures.” Such a fusion approach,
validated on held-out spatial blocks, is shown in the literature to improve nutrient prediction accuracy
and reduce RMSE 17 10 . Tables 1–2 above summarize key methods for alignment and modeling that
can be tailored to the African maize context.
Sources: Methods here are drawn from remote sensing and digital soil mapping literature (e.g. Zhang
et al. 2024 2 8 on temporal composites; Attarzadeh et al. 2018 10 on SAR-optical fusion; Lachgar et
al. 2024 17 on ML models; Bezerra et al. 2025 22 4 on temporal misalignment and cokriging). These
provide empirically-backed strategies for integrating and modeling the diverse datasets described.
1 2 3 6 7 8 12 13 14 16 21 Accurate Quantification of 0–30 cm Soil Organic Carbon in
Croplands over the Continental United States Using Machine Learning
https://www.mdpi.com/2072-4292/16/12/2217
4 5Fusion of Remotely Sensed Data with Monitoring Well Measurements for Groundwater Level
22
Management
https://www.mdpi.com/2624-7402/7/1/14
9 18 Spatial Prediction of Total Nitrogen in Soil Surface Layer Based on Machine Learning
https://www.mdpi.com/2071-1050/14/19/11998
10 Synergetic Use of Sentinel-1 and Sentinel-2 Data for Soil Moisture Mapping at Plot Scale
https://www.mdpi.com/2072-4292/10/8/1285
11 Spatiotemporal Fusion of Multisource Remote Sensing Data: Literature Survey, Taxonomy, Principles,
Applications, and Future Directions
https://www.mdpi.com/2072-4292/10/4/527
15 17 19 Implementation of Proximal and Remote Soil Sensing, Data Fusion and Machine Learning
20
to Improve Phosphorus Spatial Prediction for Farms in Ontario, Canada
https://www.mdpi.com/2073-4395/14/4/693