Sparse Reduced-rank Regression Methods for Spatially Misaligned Data with Application to Spatial Transcriptomics
Abstract
Understanding the spatiotemporal dynamics of disease progression in relation to transcriptomic profiles provides key insights into complex conditions such as Alzheimer’s disease. To enable such investigations, STARmap PLUS technology offers joint profiling of high-resolution spatial transcriptomics and protein detection within the same tissue section. Motivated by data from Zeng et al. (2023), we develop a novel kernel-weighted regression framework that models plaque size as a collective effect of the spatial transcriptomics of neighboring cells, automatically integrating across cell types and tissue samples from different disease states. To further strengthen interpretability and efficiency, we incorporate a sparse low-rank factorization that enables gene selection while borrowing strength across genes, cell types, and time points. The proposed approach is implemented in a fully automated manner with data-driven specification of key model components. Through simulation studies, we demonstrate the robustness of the proposed method and its superiority across a range of specification scenarios. Applied to Alzheimer’s data, the proposed framework uncovers biologically meaningful associations, highlighting its potential for advancing the understanding of disease mechanisms.
Keywords: Cellular microenvironment; Kernel-weighted regression; Sparse Low-rank factorization; STARmap PLUS
1 Introduction
Alzheimer’s disease (AD) comprises interacting pathologies that vary across space, cell types, and time (De Strooper and Karran, 2016; Jack Jr et al., 2018, 2024). Amyloid- plaques and hyperphosphorylated tau accumulate within a dynamic microenvironment shaped by microglia, astrocytes, oligodendrocyte-lineage cells, and neurons (Long and Holtzman, 2019; Hampel et al., 2021). Spatially resolved studies in mouse and human reveal reproducible plaque-adjacent niches and layer-specific molecular patterns, linking microglial/astrocytic remodeling and tau-aligned neuronal changes to local pathology (Chen et al., 2020, 2022; Zeng et al., 2023; Mallach et al., 2024; Miyoshi et al., 2024). Motivated by these spatially organized niches, we adopt a lesion-centric formulation: we treat plaque centers as outcome locations and quantify how nearby cellular gene expression relates to the plaque size (mean radius).
Traditional transcriptomic designs struggle to capture this heterogeneity. Bulk RNA-seq averages across mixed cell types and loses cellular heterogeneity and the spatial context of the cells, while single-cell RNA-seq restores cellular resolution but still discards in-tissue organization. Spatially resolved transcriptomics (SRT) maps gene expression to spatial coordinates via two broad classes: (i) barcoded capture arrays (Spatial Transcriptomics/Visium) that provide transcriptome‐wide profiles at multi-cell spot resolution (Ståhl et al., 2016; Vickovic et al., 2019; Maynard et al., 2021); and (ii) imaging-based assays (e.g., MERFISH(Chen et al., 2015), seqFISH(Shah et al., 2016), STARmap(Wang et al., 2018)) that achieve single-cell or subcellular resolution with targeted gene panels.
Among imaging-based SRT, STARmap PLUS (Zeng et al., 2023) co-registers targeted RNAs with amyloid- and p-tau immunostains within the same tissue sample, yielding single-cell expression aligned to plaque segmentations and centers, and enabling direct linkage between molecular readouts and neuropathology in situ. Each tissue section contains many plaques and thousands of cells with known coordinates; plaque size varies markedly within a section, and this variability is embedded in distinct cellular neighborhoods defined by which cells lie nearby and what genes they express. This provides exactly the data structure needed for a plaque-centric analysis.
Beyond Alzheimer’s disease, co-registering histopathology with SRT enables tumor-region delineation, quantification of tissue architecture and burden, and pathology-informed disease modeling (Ni et al., 2022; Arora et al., 2023; Bassiouni et al., 2023; Xun et al., 2023; Jin et al., 2024). Related statistical questions also arise in environmental applications (Alexeeff et al., 2016). This breadth underscores the generality and translational relevance of our framework for modeling disease-related spatial outcomes from spatial molecular measurements.
The primary statistical challenges are: (1) spatial misalignment between plaque locations and SRT measurement locations, (2) the high dimensionality of SRT data, (3) effect heterogeneity across collection times and cell types, i.e., regression coefficients linking plaque size to neighborhood features vary by time point and cell type, with gene-level coefficients exhibiting shared latent structure, (4) the need for a spatially structured characterization of the collective influence of different neighboring cells on plaque size, and (5) between-sample heterogeneity in plaque-adjacent microenvironments (e.g., shifts in local cell-type composition and activation states).
For regressing spatially varying outcomes on predictors from nearby locations, spatial measurement error models have been proposed, such as two-stage correction approaches (Szpiro and Paciorek, 2013), and spatial simulation extrapolation methods (Alexeeff et al., 2016). Local/kernel regressions and geographically weighted regression (GWR) model local heterogeneity by weighting nearby observations more heavily; multiscale GWR further allows each predictor have its own spatial scale parameter. However, they typically treat constructed predictors as error-free and do not scale or pool information effectively in high-dimensional, structured SRT settings (Nadaraya, 1964; Fotheringham et al., 2017; Fan, 2018; Oshan et al., 2019). Change-of-support and downscaler models propagate aggregation/registration uncertainty, yet classical formulations remain low-dimensional and do not accommodate high-dimensional features or latent commonalities across groups (Gotway and Young, 2002; Berrocal et al., 2010). Together, these approaches address parts of the problem but do not provide a unified, high-dimensional regression that learns predictor-specific spatial scales while quantifying the collective impact of neighboring cells. This gap motivates our framework.
In an asynchronous longitudinal regression setting, Cao et al. (2015); Li et al. (2022, 2023) proposed a kernel-weighted approach to handle mismatched observation times between the response and covariate within a given subject. We follow a similar direction by introducing a kernel function to borrow information from all possible pairs of outcomes and the predictors in the misaligned spatial setting. This particularly helps to characterize the collective effect of different cells from varying cell types on the plaque size. An isotropic proximity kernel with sample-specific bandwidths aggregates local signals, accommodating section-to-section enrichment and sampling differences without overfitting directionality. Gene effects are parameterized by a low-rank coefficient tensor over gene cell type time, together with gene-mode regularization to capture shared structure and heterogeneity across samples and time points under high dimensionality. To achieve interpretable gene-level discovery, we apply a LASSO penalty only on the gene mode of this factorization: genes with weak signal are shrunk toward zero, while the associated cell-type and time patterns for the retained genes remain flexible. All tuning parameters (e.g. kernel bandwidths, rank, and penalty levels) are selected in data-driven manner; details are provided in Section 3.1. The resulting framework yields interpretable component-level profiles and signed effects, where the sign of the estimated regression coefficient indicates direction (positive = larger plaque size; negative = smaller) across cell types and time points.
Applied to STARmap PLUS AD sections, the model identifies 13-month multigene expression patterns dominated by a Microglia–Oligodendrocyte axis: a microglial Apoe/Tyrobp module and an oligodendrocyte-lineage (OPC-like) pattern are positively associated with plaque size, whereas an astrocytic transport-related signature is negatively associated with plaque size (see Section 5).
We first describe the STARmap PLUS data and exploratory analyses motivating a plaque-centric estimand and isotropic neighborhood kernel. We then develop the kernel-weighted regression with sample-specific bandwidths and low-rank, gene-mode selection), assess performance in spatially realistic simulations, analyze AD sections, and close with limitations and extensions.
2 Data and Exploratory Analysis
2.1 Data
In this study, we use the single-cell spatial transcriptomics data integrated with amyloid- plaque histopathology from a mouse model of Alzheimer’s disease (AD). Specifically, data were obtained from the brains of TauPS2APP transgenic mice at two critical stages of disease progression: 8 months (8 mo) and 13 months (13 mo) of age, providing a temporal snapshot of AD pathology and cellular heterogeneity. Each of the four samples analyzed includes comprehensive gene expression data at single-cell resolution from all identified cells, alongside spatial localization of extracellular amyloid- plaques.
The SRT data employed the STARmap PLUS method (Zeng et al., 2023), which simultaneously captures spatially resolved single-cell expression and protein-level histopathological markers within the same tissue section at subcellular resolution. In this study, STARmap PLUS profiled 2,766 genes while simultaneously labeling extracellular amyloid- plaques (X-34 dye) and hyperphosphorylated tau (AT8 immunostaining), enabling precise co-registration of cell bodies, mRNA, and pathology. Although both amyloid- and tau signals are available, the present study focuses exclusively on amyloid- plaques. Voxel size was nm with fields of view per sample. The method preserves morphology and allows joint RNA-protein mapping at subcellular scales, unlike chip-based whole-transcriptome approaches with coarser pixels ( pixel size) (Chen et al., 2020).
In the original dataset, STARmap PLUS yielded a high-resolution spatial atlas across cortex and hippocampus and identified major cell types over k cells. Our analysis uses the diseased subset to focus specifically on plaque-proximal biology. Two features make this dataset uniquely informative for plaque-centric analyses: (i) exact spatial location of cells, gene expression, and plaque pathology within the same section, and (ii) clear distance-dependent patterns in cellular composition around plaques.
From Zeng et al. (2023), we obtain the plaque size data. In that, they first identified the plaques’ centers and then computed their mean radii. Fig 1 shows co-registered spatial maps of two mouse brain tissue sections collected at (a) 8 months and (b) 13 months. Gray points denote the spatial locations of all cells, and amyloid plaques are shown as red filled circles with circle radius equal to plaque sizes in .
2.2 Exploring the Spatial Variability
To characterize plaque-adjacent microenvironments, we re-examined plaque-centric organization using concentric rings around plaques (0–30, 30–70, 70–150 µm) in Figs. 2 and 3. Motivated by the concentric-ring analyses of Zeng et al. (2023), who defined distances with respect to plaque boundaries, we use the Euclidean distances from the cell centroids to the nearest plaque center in our analysis to define the plaque microenvironment.
We focus on three cell-types: Microglia, Astrocytes, and Oligodendrocyte-lineage cells, because both our exploratory analyses (Figs. 2 and 3) and prior work by Zeng et al. (2023) revealed robust, plaque-centric organization for these types with sufficient counts for stable stratification. Microglia are enriched right at plaques center with “core-shell” (as defined by Zeng et al. (2023) concerning the nearest concentric-ring) architecture closely contacting amyloid- plaques (\qtyrange[range-phrase = –, range-units = single]030\micro from the plaque center), while Astrocytes and Oligodendrocytes are enriched in surrounding shells (\qtyrange[range-phrase = –, range-units = single]30150\micro from the plaque center). Across sections and between 8 and 13 months, enrichment magnitudes generally increase, while beyond 150 µm no systematic enrichment is detected.
Guided by these spatial cell enrichments, we next construct bulk-like plaque-level summaries of near-plaque gene expression stratified by plaque size for descriptive visualization, to assess whether near-plaque expression varies with plaque size in Fig 4. The nearest plaque for each cell is identified by sorting Euclidean distances from the cell centroid to different plaque centers. A cell is considered near-plaque if its distance to the nearest plaque center is 150 µm. For visualization only, plaque size is dichotomized based on the global median radius, namely Small: plaques with radii median; Large: plaques with radii median. For each gene, heatmap values in Fig 4 are computed via two-stage averaging: first, we compute Plaque-level (bulk-like) mean for each plaque as the average gene expression over all near-plaque cells, yielding one value per (plaque, gene); and subsequently, we compute the average of plaque-level means for each time (8 mo vs 13 mo), plaque-size (Small vs Large). This finally gets us one value per time, size, and gene, which forms the heatmap entries.
Fig 4 tells us: (i) the difference-in-differences for each gene expression, showing how the 813 mo change near Large plaques compares with the change near Small plaques (positive bars = larger increase or smaller decrease near Large plaques), (ii) a heatmap of the per-size change from 8 mo to 13 mo shown separately for Small and Large, and (iii-iv) heatmaps of expression at 8 mo and 13 mo (columns = Small vs Large). Row centering (within time) is applied only to the expression heatmaps (iii–iv); the bar plot and the change heatmap show raw contrasts. Across the 30 most varying genes over time, these summaries reveal size-associated differences in near-plaque expression and motivate our regression problem below, where plaque radius is regressed on the local gene expression levels.
3 Model
We propose a novel kernel-weighted regression framework to analyze the effect of spatial transcriptomic profiles on amyloid- plaque deposition. For clarity, we first introduce some notations that are used throughout the article.
Let stands for the total number of tissue samples. For each sample , let denotes the number of outcome locations (e.g., plaques) and be their spatial coordinates. At each of these locations, we observe spatially indexed scalar outcomes, such as plaque sizes, denoted by . Let stands for the number of cells in - sample and be their spatial coordinates. The spatial transcriptomics data, recording gene expression levels of genes, are available at individual cell locations, denoted by . Furthermore, the samples are obtained at different stages of the disease progression, where represents the corresponding collection time of the - sample. Note that all tissue samples are independent, as they come from different mice. Each cell further belongs to one of the cell-types. For clarity, let represent the cell type associated with the cell at location in - sample.
Our goal is to estimate the effect of gene expression levels on plaque sizes, measured at different sample collection times. Specifically, we aim to estimate , whose - value quantifies the effect of the - gene’s expression (among the genes) in the - cell type on plaque sizes collected at the - time point. However, to run this analysis, we must take into account the following three characteristics in our proposed model:
1) The spatial locations of the cells and plaques do not match in general.
2) Each plaque is surrounded by cells with different cell types.
3) The gene-specific regression effects for different cell types may share some commonalities.
4) Samples collected at different time points are independent, but gene-specific effects may share commonalities through underlying latent factors that capture shared biological or regulatory dependencies.
In order to resolve these issues, we utilize the inherent spatial dependence in the data. Specifically, aligning with the realistic scenarios arising from integrating single-cell spatial transcriptomics data with amyloid-beta plaque data in Alzheimer’s disease, spatial proximity informs similarities in gene expression levels and biological response patterns, using sample-specific, density-adaptive neighborhoods.
Thus, to characterize the relationship between spatially mismatched and , we first propose the following kernel-weighted objective function,
| (1) |
where standing for the cell-type of the cell present at location , is the collection time for sample , and are nonnegative kernel weights with possibly sample-specific bandwidth . For the kernel , we consider the Epanechnikov kernel to assign weights based on the Euclidean distance between and . This compactly supported kernel determines the contribution of each observation to the local fit. The Epanechnikov kernel is a widely used choice due to its optimality in minimizing mean squared error. The unscaled kernel is defined as , where denotes the positive part. Given a bandwidth parameter , the kernel is scaled as . This formulation ensures that observations farther than units apart receive zero weight, while closer observations have smoothly decreasing influence as their distance approaches . Now, there are unknown parameters to be estimated in . Direct estimation of these parameters is computationally infeasible and fails to exploit the inherent dependencies among genes, cell types, and time. Moreover, sparsity-inducing structures are required to identify important genes. Thus, we propose a sparse reduced-rank factorization of the regression coefficient to identify important genes while borrowing information across different cell types and time.
Low-rank factorization of the regression coefficient:
In order to judiciously estimate the regression coefficient , we utilize a reduced-rank factorization that leverages shared structure across cell types, time, and genes. In that, we first rewrite it as a tensor with three dimensions: Genes Cell-types Times, where the - entry is , for , , and . One may then apply the tensor-factorization methods, such as CANDECOMP/PARAFAC (CP) Carroll and Chang (1970); Harshman (1970); Kiers (2000), or more generally the Tucker decomposition Tucker (1966). In this work, we employ a CP decomposition on as:
where stands for the norm which is defined as . By restricting , , and , to be normalized individually, it ensures that captures the scale for each rank. This further helps in rank selection and introducing sparsity, as discussed later. The CP decomposition reduces parameter dimensionality while enabling information sharing across genes, cell types, and time. Specifically, we represent as a sum of components; indexes the component (“rank”). Note that under the CP decomposition, the same rank appears across all modes by construction, since denotes the number of rank-one components. For greater flexibility, Tucker factorization with mode-specific ranks may be considered. However, for sparse tensors, CP decompositions are usually adequate.
In the gene direction, we place a LASSO penalty on for each rank , inducing sparsity in . The resulting objective function is:
| (2) |
where stands for the norm which is defined as . It allows us to obtain a sparse solution and select the important genes.
3.1 Estimation
The overall implementation is fully automated with data-driven efficient tuning. We update all the parameters using a blocked coordinate descent algorithm. To do that, we first partition the full parameter set into blocks with - block: . In each iteration, we cycle through , and update the components sequentially in each block. The blocked coordinate descent updates are given below. After each update, we normalize by its corresponding norm and adjust accordingly.
We update by holding all others fixed, and use the leave-one-coordinate-out residual
Define the sufficient statistics
Let be the soft-resholding operator where . Then the closed-from update is
| (3) |
with the convention if . For fixed , the objective function reduces to , whose minimizer is the soft-resholding operator above.
For updating the entry in the cell-type loadings, we set . This block is unpenalized weighted least squares:
| (4) |
Similarly for the time loadings, we set . Then
| (5) |
To update entry , we set . Then
| (6) |
Computational efficiency and stability:
In our implementation, we keep track of the residuals and update them incrementally. When a coordinate changes by , set ; analogous updates are used for , and . This reduces the per-coordinate cost from for recomputation to , yielding substantial speed-ups without any approximation.
After updating the factors for each rank, we rescale them to unit norm and absorb the entire scale into the scaler . Let for . We set
so that the contribution from the - rank, , (and thus fitted values) is unchanged. This prevents factor columns from arbitrarily exploding/vanishing with compensating shrink/expansion in , keeps the coordinate-descent curvature well conditioned, and makes the penalty on act consistently rather than depending on arbitrary scaling of .
Stopping rules:
One may continue the above iterative process up to a fixed number of iterations. Instead, we continue until the following predefined convergence criterion is met:
| (7) |
The second criterion in Eq.(7) was also used in Anandkumar et al. (2015) and Sun et al. (2017) in the context of tensor-valued response regression.
Automatic CP-rank update:
In our approach, the procedure of selecting CP rank is seamlessly integrated into the blocked Coordinate Descent algorithm. We do not need to specify or tune the rank during model fitting manually. Instead, the method begins by initializing the model with a sufficiently large CP rank, which we recommend setting to the product of the number of cell type () and time points (), i.e., . The general upper bound on the maximum CP rank for any three-way tensor satisfies (Kruskal, 1989; Kolda and Bader, 2009)
| (8) |
A rank-dropping strategy is then applied during the first 500 iterations, and the algorithm actively prunes unnecessary components in these steps. This strategy ensures that the subsequent updates focus on stabilizing the estimation across iterations based on the selected rank after the rank-dropping. Specifically, the CP rank of is reduced from to by discarding - component if any of the following conditions are met:
| (9) |
By automatically discarding redundant or negligibly small components, the rank-dropping strategy allows the model to achieve both good estimation accuracy and computational efficiency by selecting a final rank that balances model complexity and goodness-of-fit automatically.
Lambda selection:
We first fit the method for different choices of and then select the optimal based on a BIC-like criteria. To ensure computational efficiency, we begin with the penalty parameter that yields the all-zero solution and progressively decrease it, as done in the path solution algorithms (Efron et al., 2004; Friedman et al., 2010). The successive penalty parameters are then set recursively as . Finally, for each penalty, we evaluate the following BIC-like criterion to do the final selection:
| (10) |
where and is the number of nonzero entries in , , and . Although one can choose a smaller decay factor than 0.9 for a finer exploration, in our experiments, this choice works reasonably well.
Bandwidth selection:
The kernel bandwidth is set based on an elbow rule, as discussed below. First, we set the candidate bandwidths as the medians of the distances from each plaque to its - closest neighboring cell for different choices of ’s. Let the median distance based on - closest neighboring cell in - sample be , where where stands for the sorted sequence of the distances of different cells from - plaque i.e. .
Finally, we plot the normalized loss
as a function of the neighborhood size and select using the elbow (point-of-diminishing-returns) criterion, defined as the value of maximizing the perpendicular distance to the line connecting the endpoints of the loss curve (Satopaa et al., 2011). Even with the same for all samples, the sample-specific bandwidths may be different.
3.2 Initialization
To initialize the model parameters , , , and , we apply a weighted Ridge regression without the CP decomposition structure. For each pair , define the objective
and the estimator
Here, we put the penalty on the coefficient vector. A CP decomposition with the initial rank described above is then applied to the resulting to obtain the initial model parameters, where , for .
Step 1: Select a sequence of and . For each choice of run the following:
Step 2: Select the optimal from the pre-specified sequence for each choice of using the criteria in Eq.(10).
Step 3: Select the optimal applying the elbow rule, and get the final estimate .
4 Simulations
To evaluate our method’s ability to recover the important genes and quantify gene-plaque relationships, we design a simulation setting relying on simulated spatial transcriptomic data using the STRsim package (Zhu et al., 2023). Following the real data, the plaque sites are set well-separated in the simulation too.
We compare our estimates with those from a LASSO-regularized linear regression model. Since fitting LASSO requires the data in a paired form as (outcome, predictors), we first organize the data as , where is the gene expression of the cell nearest to - outcome . Then, we fit the LASSO regression model.
4.1 Data Generation
While generating the simulated data, we mimic our real data and consider two time points. Thus, for each replicated dataset, we generate two square-shaped SRT samples using SRTsim package, one for each time point. For each sample , let denote the set of spatial spots. For each sample , we simulate a square section with an expected spots and cell-type groups (SRTsim “number of groups”=3). SRTsim returns a gene-by-spot expression matrix , 2D coordinates , and group labels . Each spot has a unique location within the section.
For each sample , we designate a plaque set of size , chosen to be well separated in space and balanced across cell types (i.e., , where ). The remaining spots serve as the pool of neighboring predictor cells, reflecting the spatial misalignment in real data where outcomes are observed at plaques while predictors are measured at nearby SRT locations.
For each plaque , we generate the outcom using the linear model:
where is the regression coefficient for - cell-type at time . Let . Rather than independent noise, we add spatially correlated errors,
with . We estimate once from the real plaque by fitting an exponential variogram using geoR (variog/variofit; Ribeiro Jr and Diggle (2025)) and then hold fixed across samples and replicates. Although is generated from the expression at the same plaque location , in the downstream analysis we treat plaque-site expression as unobserved and remove from the predictor set, so are modeled using only neighboring non-plaque spots .
We set the total number of predictors at 50, with 5 out of 50 having non-zero effects, and assume a CP-based low-rank form with rank 4. In order to achieve this, we set a randomly selected 5 rows in to be non-zero and the rest all zero, with non-zero entries generated from The entries in and are generated from and respectively.
4.2 Estimation
We set the at following maximum rank characterization in Eq.(8), and vary bandwidth over choosing . We specify . We then apply the procedure described in Algorithm 1, which performs data-driven selection of these parameters and yields the final estimate.
To assess variable selection performance, we compute the true positive rate (TPR) and false positive rate (FPR) at threshold in Table 1 as
By thresholding the estimated coefficients at a sequence of cutoff values, we trace out an ROC curve in the TPR–FPR plane. We augment it with the extreme points (0,0) and (1,1) to ensure a full range of rate values, and compute the area under the curve (AUC) via the trapezoidal rule.
To quantify overall estimation accuracy, we calculate the mean squared error (MSE) of the coefficient estimates:
Across 30 replicates over plaque numbers and error variances , the proposed method consistently delivers better discrimination than the paired LASSO: AUC = 0.54–0.68 (mean 0.606) versus 0.50–0.54 (mean 0.521), and typically lower MSE. The gains persist under high noise and grow with larger , indicating accurate recovery of the underlying cell-type-specific gene effects. As shown in Table 1, the estimates from our proposed method are better both in estimation accuracy and in selecting the genes with non-zero effects, whereas the paired LASSO often collapses to all-zero solutions.
| Proposed MSE | Paired MSE | Proposed AUC | Paired AUC | ||
| 50 | 1 | 6.333 | 7.086 | 0.574 | 0.485 |
| 50 | 5 | 6.367 | 7.014 | 0.563 | 0.485 |
| 50 | 10 | 6.392 | 6.985 | 0.544 | 0.484 |
| 50 | 100 | 6.891 | 8.029 | 0.495 | 0.501 |
| 50 | 200 | 7.359 | 9.169 | 0.510 | 0.507 |
| 100 | 1 | 6.256 | 7.455 | 0.647 | 0.514 |
| 100 | 5 | 6.253 | 7.422 | 0.629 | 0.524 |
| 100 | 10 | 6.275 | 7.380 | 0.650 | 0.525 |
| 100 | 100 | 6.403 | 8.330 | 0.565 | 0.539 |
| 100 | 200 | 6.586 | 9.256 | 0.568 | 0.546 |
| 200 | 1 | 6.278 | 6.706 | 0.626 | 0.484 |
| 200 | 5 | 6.235 | 6.717 | 0.607 | 0.489 |
| 200 | 10 | 6.273 | 6.705 | 0.606 | 0.494 |
| 200 | 100 | 6.319 | 6.815 | 0.565 | 0.501 |
| 200 | 200 | 6.486 | 8.356 | 0.568 | 0.498 |
5 A Plaque Size Analysis
In this section, we perform the integrative spatial transcriptomic analysis to examine the relationship between gene expression profiles and Amyloid- (A) plaque pathology in Alzheimer’s disease (AD), focusing on two different stages of disease progression and three specific cell types. We analyze four SRT tissue sections () from STARmap PLUS, with two sections at each time point (8 and 13 months). Specifically, sections correspond to the two 8-month replicates, and sections correspond to the two 13-month replicates. The cell counts are , and plaque counts are . We use the normalized expression provided by Zeng et al. (2023) to place all genes on a common scale across sections and time points so that estimated coefficients are comparable across genes and over time.
We set the at , and . The final selected values are always found to be smaller than these pre-specified upper bounds. We consider the discrete grid for the number of neighbors , while setting the bandwidth for the - sample; the bandwidth selection procedure is described in Algorithm 1.
Prior to fitting the plaque-size regressions, we apply an expression filter independent of plaque morphology to exclude transcripts with insufficient detection for stable estimation. Specifically, within each of the three target cell types and for each tissue sample collected at 8 and 13 months, we retain genes detected (nonzero) in of near-plaque cells (within m of the plaque center). We then augment the retained set with (i) 64 marker genes from Zeng et al. (2023) and (ii) plaque-induced genes (PIGs) from Chen et al. (2020) (32 of the 57 PIGs are present in our dataset). This procedure yields a final gene panel of 182 transcripts for modeling plaque morphology.
From our proposed model, we estimate a 3-dimensional coefficient tensor (gene cell type time point), with a CP structure to relate plaque size to local, cell-type-resolved expression across time. The final rank is with penalty . For the bandwidth selection, elbow rule give us the choice of for all four samples. Accordingly, we set the sample-specific bandwidth for the kernel function as , where denotes the median distance to the - nearest neighboring cell in sample (defined in Sec 3.1). This yields m. Here, each coefficient is associated with the normalized expression of gene , while setting other variables in the model fixed for cell type and time .
For each cell type and time , we summarize the overall strength of association between gene expression in cell type and the plaque size at time by the norm of all gene-effects. We summarize the averaged direction of association by i.e., the mean gene effect across all genes for cell type at time . At 8 months, is modest (Astrocyte: 2.30; Oligodendrocyte: 6.20; Microglia: 3.04), but it increases sharply by 13 months (Astrocyte: 19.40; Oligodendrocyte: 39.74; Microglia: 21.81), indicating substantially stronger plaque–expression associations over time. The signed means are near zero at 8 months (Astrocyte: , Oligodendrocyte ), while Microglia shows a small positive mean (), meaning that higher microglial gene expression tends to be associated with larger plaques on average at 8 months. By 13 months, the average effects are predominantly positive for Oligodendrocyte (0.0379) and especially Microglia (0.1536), while Astrocyte remains near zero (0.0017). Overall, these results point to strong, increasing plaque-proximal gene-expression effects in oligodendrocyte-lineage cells and microglia over time, with microglia showing the clearest positive directional signal at 13 months.
Following the CP structure, we pursue a component-level interpretation of the results here. CP Decomposes the coefficient tensor into four rank-1 latent factors that summarize the gene–cell-type–time patterns in the association with plaque size (Table 2). The first three components have substantially larger weights than the fourth. For component with factors (genes), (cell types), (times) and weight , we orient it so that its top-loading genes have positive entries in the gene factor, and then summarize its overall direction using the signed-average product
We refer to 13 months as the late disease stage. As shown in Table 2, each component is dominated by the late time point, with the 13-month time loading having magnitude close to 1, indicating that the estimated associations concentrate their mass at 13 months (with direction determined by the joint sign across modes). Among them, Component 3 shows a clearly positive direction (Net direction ) with largest loadings on Oligodendrocytes and Microglia at 13 months; top genes include Trem2, Tmsb4x, Tyrobp, and Plp1, indicating a late microglia–oligodendrocyte component-derived signature in which higher expression is associated with larger plaques. In contrast, Components 1 and 2 carry negative temporal loadings at 13 months and split cell-type emphasis–one tilting toward Oligodendrocytes (Oligodendrocyte ↑, Microglia ↓) and the other toward Microglia (Microglia ↑, Astrocyte ↓). Despite substantial overlap in top genes (e.g., Trem2, C1qa, Aplp1), their opposite cell-type signs imply opposing directions of association across components at the same time point, consistent with heterogeneity between microglia-enriched expression patterns and oligodendrocyte/myelin related pathways. Finally, Component 4 is uniformly negative (Net direction ), dominated by Oligodendrocyte and Microglia loadings at 13 months with genes such as Tmsb4x, Plp1, C1qa, and Cst3, suggesting a late-stage component pattern in which higher expression accompanies smaller plaques. Thus, in summary, the estimated associations concentrate at 13 months and along a Microglia–Oligodendrocyte axis, but with mixed directions across components: Component 3 aligns with larger plaques, whereas Components 1, 2, and 4 align with smaller plaques, suggesting distinct plaque-proximal expression patterns.
These multigene expression patterns arising from the CP-component-based results align well with the known biological knowledge: microglial activation modules (e.g., Trem2, Tyrobp, C1q complex) rise near plaques (Hong et al., 2016; Keren-Shaul et al., 2017; Krasemann et al., 2017), while oligodendrocyte/myelin genes (e.g., Plp1, Mbp) mreflect white-matter processes implicated in AD progression (McKenzie et al., 2017; Nasrabady et al., 2018). CP combines these signals into a small number of interpretable, signed components that are localized by cell type and disease stage, enabling component-level summaries rather than gene-by-gene effects. For instance, a component dominated by microglia at 13 months shows a positive association with plaque size, whereas a component dominated by oligodendrocytes at 13 months shows a negative association.
| Component, | Weight, | NetDir | Top cells | Top times | Top genes |
| Oligodendrocyte (), Microglia () | 13 mo (), 8 mo () | Trem2(), Tyrobp(), C1qa(), Aplp1(), Mbp() | |||
| Microglia (), Astrocyte () | 13 mo (), 8 mo () | Trem2(), C1qa(), Aspa(), Aplp1(), Tyrobp() | |||
| Oligodendrocyte (), Microglia () | 13 mo (), 8 mo () | Trem2(), Tmsb4x(), Tyrobp(), Plp1(), Aplp1() | |||
| Oligodendrocyte (), Microglia () | 13 mo (), 8 mo () | Tmsb4x(), Plp1(), C1qa(), Cst3() |
Gene-level examples consistent with prior literature (associational). To summarize temporal changes in estimated regression coefficients (associational gene effects), we write for gene in cell type , where (8 months) and (13 months). Astrocytes: transport-related markers show increasingly negative associations with plaque size (Aqp4 , Slc13a3 ), while alarmin/stress signals become more positively associated (Il33 , S100a6 ), consistent with reactive astrocyte frameworks and IL-33–microglia crosstalk (Escartin et al., 2021; Vainchtein et al., 2018). Oligodendrocyte lineage: larger plaques are associated with stronger precursor-like signal (Gpr17 ) and reduced mature oligodendrocyte identity (Sox10 , Aspa ). These patterns are consistent with reported links between oligodendrocyte function and iron/transferrin biology (Todorich et al., 2009), and OPC–vasculature coupling provides spatial context (Tsai et al., 2016). Microglia: an early Trem2-associated effect attenuates and reverses (Trem2 ), while a later Apoe/Tyrobp-centered signature becomes strongly positively associated with plaque size (Apoe ; Tyrobp ), mirroring reported DAM-like shifts and the TREM2–APOE pathway (Keren-Shaul et al., 2017; Krasemann et al., 2017; Deczkowska et al., 2018).
Novel or less-established associations requiring validation (hypothesis-generating). Oligodendrocyte “logistics”: strong, late positive coefficients for resource intake and protein production (Trf , Pabpc1 , Caskin1 ) suggest a resource-mobilizing oligodendrocyte niche as plaques enlarge; in contrast, Flt1 is strongly negative (), pointing to a vascular/adhesion signature aligned with smaller plaques. Microglial cytoskeleton and lysosome: Rhoc shows a robust negative association (), consistent with more contractile morphology near smaller plaques, while cathepsins diverge (Ctsl vs Ctss ), indicating targeted lysosomal remodeling rather than uniform activation. Astrocytes: persistence and amplification of transport↓/alarmin↑ patterns across time underscore a simple, testable hypothesis: supporting basic astrocytic transport may align with smaller plaques.
6 Conclusion
In this paper, we introduce a novel kernel-weighted method to model how plaque-proximal single-cell spatial transcriptomic expression patterns relate to plaque size. The proposed framework is designed to capture both cell-type-specific and time-varying effects, thereby accommodating heterogeneity across genes, cell populations, and sample collection times. In addition, the method is fully automated through data-driven specification of the tuning parameters, such as rank, penalty, and bandwidth. We further adopt a low-rank model for the regression coefficients, which naturally captures the shared structure and inherent dependencies among genes, cell types, and time points.
Our analysis reveals several plaque-size-associated expression patterns that are strongest at the 13-month data, mainly in microglia (brain immune cells) and oligodendrocytes (myelin-related cells), indicating that plaque-size associations reflect multiple transcriptional signatures acting in parallel rather than a single dominant signature. At a late disease stage, the microglial pattern is characterized by lipid-handling and complement-related genes, consistent with a plaque-proximal activation signature (Hong et al., 2016; Keren-Shaul et al., 2017; Krasemann et al., 2017); oligodendrocyte-lineage cells show a shift toward less mature (precursor-like) signatures, together with increased iron-homeostasis and protein-synthesis transcripts; and astrocyte signals are comparatively weak in net plaque-size tracking, with transport-related genes showing attenuated associations. These findings suggest testable hypotheses for future validation: reduce the late-stage microglial-dominant lipid/complement activity, support astrocyte water/solute transport, and steer oligodendrocytes toward a mature maintenance state to evaluate whether plaque growth can be limited.
For future work, we will analyze how gene expression changes in comparison to the plaque size. This will let us map which genes and multigene expression patterns rise or fall as plaques enlarge, compare patterns across different disease stages and brain regions. These insights will have important translational implications for understanding disease mechanisms and guiding hypothesis-driven experimental studies. Methodologically, there are opportunities to consider more flexible Tucker factorization for , and also other types of effect characterizations beyond linearity.
Data availability
We use publicly available data from Zeng et al. (2023) for our analysis.
Acknowledgement
The authors would like to thank Dr. Dongyuan Wu for helpful discussions and suggestions regarding the data. During the preparation of this work, the authors used ChatGPT to assist with writing. All content was subsequently reviewed and edited by the authors, who take full responsibility for the publication.
References
- Spatial measurement error and correction by spatial simex in linear regression models when using predicted air pollution exposures. Biostatistics 17 (2), pp. 377–389. Cited by: §1, §1.
- Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank- Updates. arXiv. Note: arXiv:1402.5180 Cited by: §3.1.
- Spatial transcriptomics reveals distinct and conserved tumor core and edge architectures that predict survival and targeted therapy response. Nature communications 14 (1), pp. 5029. Cited by: §1.
- Spatial transcriptomic analysis of a diverse patient cohort reveals a conserved architecture in triple-negative breast cancer. Cancer research 83 (1), pp. 34–48. Cited by: §1.
- A spatio-temporal downscaler for output from numerical models. Journal of agricultural, biological, and environmental statistics 15 (2), pp. 176–197. Cited by: §1.
- Regression Analysis of Sparse Asynchronous Longitudinal Data. Journal of the Royal Statistical Society Series B: Statistical Methodology 77 (4), pp. 755–776 (en). External Links: ISSN 1369-7412, 1467-9868 Cited by: §1.
- Analysis of Individual Differences in Multidimensional Scaling Via an N-way Generalization of “Eckart-Young” Decomposition. Psychometrika 35 (3), pp. 283–319 (en). External Links: ISSN 0033-3123, 1860-0980 Cited by: §3.
- Spatially resolved, highly multiplexed rna profiling in single cells. Science 348 (6233), pp. aaa6090. Cited by: §1.
- Spatially resolved transcriptomics reveals genes associated with the vulnerability of middle temporal gyrus in alzheimer’s disease. Acta neuropathologica communications 10 (1), pp. 188. Cited by: §1.
- Spatial Transcriptomics and In Situ Sequencing to Study Alzheimer’s Disease. Cell 182 (4), pp. 976–991.e19 (en). External Links: ISSN 00928674, Link, Document Cited by: §1, §2.1, §5.
- The cellular phase of alzheimer’s disease. Cell 164 (4), pp. 603–615. Cited by: §1.
- Disease-associated microglia: a universal immune sensor of neurodegeneration. Cell 173 (5), pp. 1073–1081. External Links: Document Cited by: §5.
- Least angle regression. Cited by: §3.1.
- Reactive astrocyte nomenclature, definitions, and future directions. Nature Neuroscience 24, pp. 312–325. External Links: Document Cited by: §5.
- Local polynomial modelling and its applications: monographs on statistics and applied probability 66. Routledge. Cited by: §1.
- Multiscale geographically weighted regression (mgwr). Annals of the American Association of Geographers 107 (6), pp. 1247–1265. Cited by: §1.
- Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 (1), pp. 1–22. External Links: Document Cited by: §3.1, 2.
- Combining incompatible spatial data. Journal of the American Statistical Association 97 (458), pp. 632–648. Cited by: §1.
- The amyloid- pathway in alzheimer’s disease. Molecular psychiatry 26 (10), pp. 5481–5503. Cited by: §1.
- Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis. External Links: Link Cited by: §3.
- Complement and microglia mediate early synapse loss in alzheimer mouse models. Science 352 (6286), pp. 712–716. Cited by: §5, §6.
- Revised criteria for diagnosis and staging of alzheimer’s disease: alzheimer’s association workgroup. Alzheimer’s & Dementia 20 (8), pp. 5143–5169. Cited by: §1.
- NIA-aa research framework: toward a biological definition of alzheimer’s disease. Alzheimer’s & dementia 14 (4), pp. 535–562. Cited by: §1.
- Advances in spatial transcriptomics and its applications in cancer research. Molecular Cancer 23 (1), pp. 129. Cited by: §1.
- A unique microglia type associated with restricting development of alzheimer’s disease. Cell 169 (7), pp. 1276–1290.e17. External Links: Document Cited by: §5, §5, §6.
- Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics 14 (3), pp. 105–122 (en). External Links: ISSN 0886-9383, 1099-128X, Link, Document Cited by: §3.
- Tensor Decompositions and Applications. SIAM Review 51 (3), pp. 455–500 (en). External Links: ISSN 0036-1445, 1095-7200, Link, Document Cited by: §3.1.
- The trem2-apoe pathway drives the transcriptional phenotype of dysfunctional microglia in neurodegenerative diseases. Immunity 47 (4), pp. 566–581. External Links: Document Cited by: §5, §5, §6.
- Rank, decomposition, and uniqueness for 3-way and n-way arrays. In Multiway data analysis, pp. 7–18. Cited by: §3.1.
- Regression Analysis of Asynchronous Longitudinal Functional and Scalar Data. Journal of the American Statistical Association 117 (539), pp. 1228–1242 (en). External Links: ISSN 0162-1459, 1537-274X Cited by: §1.
- Asynchronous Functional Linear Regression Models for Longitudinal Data in Reproducing Kernel Hilbert Space. Biometrics 79 (3), pp. 1880–1895 (en). External Links: ISSN 0006-341X, 1541-0420 Cited by: §1.
- Alzheimer disease: an update on pathobiology and treatment strategies. Cell 179 (2), pp. 312–339. Cited by: §1.
- Microglia-astrocyte crosstalk in the amyloid plaque niche of an alzheimer’s disease mouse model, as revealed by spatial transcriptomics. Cell Reports 43 (6). Cited by: §1.
- Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nature neuroscience 24 (3), pp. 425–436. Cited by: §1.
- Multiscale network modeling of oligodendrocytes reveals molecular components of myelin dysregulation in alzheimer’s disease. Molecular neurodegeneration 12 (1), pp. 82. Cited by: §5.
- Spatial and single-nucleus transcriptomic analysis of genetic and sporadic forms of alzheimer’s disease. Nature Genetics 56 (12), pp. 2704–2717. Cited by: §1.
- On estimating regression. Theory of Probability & Its Applications 9 (1), pp. 141–142. Cited by: §1.
- White matter changes in alzheimer’s disease: a focus on myelin and oligodendrocytes. Acta neuropathologica communications 6 (1), pp. 22. Cited by: §5.
- SpotClean adjusts for spot swapping in spatial transcriptomics data. Nature Communications 13 (1), pp. 2971. Cited by: §1.
- Mgwr: a python implementation of multiscale geographically weighted regression for investigating process spatial heterogeneity and scale. ISPRS International Journal of Geo-Information 8 (6), pp. 269. Cited by: §1.
- GeoR: analysis of geostatistical data. Note: R package version 1.9-6 External Links: Document, Link Cited by: §4.1.
- Finding a" kneedle" in a haystack: detecting knee points in system behavior. In 2011 31st international conference on distributed computing systems workshops, pp. 166–171. Cited by: §3.1.
- In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron 92 (2), pp. 342–357. Cited by: §1.
- Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353 (6294), pp. 78–82. Cited by: §1.
- Provable Sparse Tensor Decomposition. Journal of the Royal Statistical Society Series B: Statistical Methodology 79 (3), pp. 899–916 (en). External Links: ISSN 1369-7412, 1467-9868 Cited by: §3.1.
- Measurement error in two-stage analyses, with application to air pollution epidemiology. Environmetrics 24 (8), pp. 501–517. Cited by: §1.
- Oligodendrocytes and myelination: the role of iron. Glia 57 (5), pp. 467–478. External Links: Document Cited by: §5.
- Oligodendrocyte precursors migrate along vasculature in the developing nervous system. Science 351 (6271), pp. 379–384. External Links: Document Cited by: §5.
- Some Mathematical Notes on Three-Mode Factor Analysis. Psychometrika 31 (3), pp. 279–311 (en). External Links: ISSN 0033-3123, 1860-0980 Cited by: §3.
- Astrocyte-derived interleukin-33 promotes microglial synapse engulfment and neural circuit development. Science 359 (6381), pp. 1269–1273. External Links: Document Cited by: §5.
- High-definition spatial transcriptomics for in situ tissue profiling. Nature methods 16 (10), pp. 987–990. Cited by: §1.
- Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 361 (6400), pp. eaat5691. Cited by: §1.
- Reconstruction of the tumor spatial microenvironment along the malignant-boundary-nonmalignant axis. Nature Communications 14 (1), pp. 933. Cited by: §1.
- Integrative in situ mapping of single-cell transcriptional states and tissue histopathology in a mouse model of Alzheimer’s disease. Nature Neuroscience (en). External Links: ISSN 1097-6256, 1546-1726 Cited by: §1, §1, §2.1, §2.1, §2.2, §2.2, §5, §5, §6.
- SRTsim: spatial pattern preserving simulations for spatially resolved transcriptomics. Genome biology 24 (1), pp. 39. Cited by: §4.