\NAT@set@cites

An Effective Gram Matrix Characterizes Generalization
in Deep Networks

Rubing Yang Pratik Chaudhari
University of Pennsylvania
Email: [email protected] [email protected]

Abstract

We derive a differential equation that governs the evolution of the generalization gap when a deep network is trained by gradient descent. This differential equation is controlled by two quantities, a contraction factor that brings together trajectories corresponding to slightly different datasets, and a perturbation factor that accounts for them training on different datasets. We analyze this differential equation to compute an “effective Gram matrix” that characterizes the generalization gap after training in terms of the alignment between this Gram matrix and a certain initial “residual”. Empirical evaluations ¹¹1Code at https://github.com/grasp-lyrl/effective-gram-matrix.git on image classification datasets indicate that this analysis can predict the test loss accurately. Further, at any point during training, the residual predominantly lies in the subspace of the effective Gram matrix with the smallest eigenvalues. This indicates that the training process is benign, i.e., it does not lead to significant deterioration of the generalization gap (which is zero at initialization). The alignment between the effective Gram matrix and the residual is different for different datasets and architectures. The match/mismatch of the data and the architecture is primarily responsible for good/bad generalization.

1 Introduction

Generalization is the ability of a model to apply patterns learned from training data to new, unseen data. Deep neural networks are interesting in this regard because, despite having many parameters and a complex loss landscape, they can still generalize well. This challenges the traditional statistical wisdom, e.g., the bias-variance trade-off, which suggests that highly flexible models should overfit and perform poorly on test data. Deep networks however consistently perform well on unseen data, raising fundamental questions about the principles that govern their generalization. A large body of work has sought to tackle this question and there are numerous perspectives on the relationship between training data, test data, and the model class for deep networks in the literature today. While this work provides valuable insights, each of the existing lines of attack have their limitations.

Probably-approximately correct (PAC) frameworks

The PAC learning framework (Valiant, 1984) provides generalization bounds for models trained on independently and identically distributed (i.i.d.) data, using measures such as Vapnik-Chervonenkis (VC) dimension or Rademacher complexity to characterize the hypothesis space. The PAC-Bayes framework (McAllester, 1999) extends these ideas by deriving generalization error bounds for randomized estimators. However, both frameworks are limited in their ability to explain the generalization behavior of modern deep neural networks. Despite their rich hypothesis class and extremely large VC dimensions, deep neural networks consistently achieve remarkable generalization, a phenomenon that defies the worst-case assumptions inherent in these classical frameworks.

Simplified models of deep networks

To move beyond worst-case analyses, a direct examination of the exact solutions of deep neural networks is an appealing approach. However, due to the inherent complexity of these networks, such analyses are often intractable. Instead, various solvable models from statistics and physics have been employed to partially characterize deep neural network behavior, offering valuable theoretical insights. Linear regression, for instance, has been widely used to explore phenomena like benign over-fitting (Bartlett et al., 2020) and double descent (Hastie et al., 2022; Belkin et al., 2020). To investigate the effects of depth, deep linear networks have served as a useful abstraction for studying multi-layer dynamics (Laurent and von Brecht, 2018). One of the most prominent frameworks in this area is the Neural Tangent Kernel (NTK) approach (Jacot et al., 2018), which models the training dynamics of deep neural networks in the infinite-width regime under a kernel-based approximation. Mallinar et al. (2022); Belkin et al. (2018) characterized different regimes of kernel regression and analyzed its resemblance with deep learning. The NTK method has enabled significant results on convergence (Du et al., 2019; Li and Liang, 2018) and generalization (Arora et al., 2019; Jacot et al., 2020). Bowman and Montúfar (2022) analyzed the divergence of finite-width neural networks with NTK regime in different eigenspaces. Similarly, mean-field analysis (Chizat and Bach, 2018; Mei et al., 2019) has been used to study the evolution of neuron distribution in the infinite-width limit. These models often rely on assumptions such as convexity in the loss landscape and constraining the dynamics to a region around initialization. This precludes feature learning—a critical aspect of modern deep neural networks.

Non-worst-case generalization bounds

Infinite-width assumptions are impractical for real-world scenarios, prompting research into deriving bounds for general neural networks by making mild assumptions about the training process and data. For example, Bartlett et al. (2017); Neyshabur et al. (2018) analyzed the complexity of the reachable hypothesis class and proposed weight-dependent generalization bounds that restrict the hypothesis space based on the weights’ distance to a reference point. Dziugaite and Roy (2017); Yang et al. (2022) established generalization bounds for stochastic algorithms using properties of the trained minima, however, these bounds are derived in a post-hoc manner, based on the trained solution. Algorithm-specific approaches, such as sensitivity analyses on the effects of perturbations of the dataset, provide insights into algorithmic stability and its impact on generalization (Bousquet and Elisseeff, 2002; Hardt et al., 2016; Xu and Mannor, 2012; Chu and Raginsky, 2023). Kawaguchi et al. (2022) gives a generalization bound from data-dependent robustness analysis. Xu and Raginsky (2017); Mou et al. (2018) explored the stability of stochastic algorithms using information-theoretic approaches, which has led to further discussions (Negrea et al., 2019; Neu et al., 2021; Lugosi and Neu, 2022) that offer generalization bounds that depend on the training trajectory. By exploring the conditional mutual information (Hafez-Kolahi et al., 2020; Steinke and Zakynthinou, 2020), one can get tighter generalization bounds. Additionally, by assuming specific properties of the training loss landscape, studies such as (Kozachkov et al., 2023) and (Lugosi and Neu, 2022) provided generalization guarantees. These methods often rely on assumptions that are uniformly applied across the entire hypothesis space, which can be problematic. Task-specific analysis of neural network training are provided by (Ramesh et al., 2024; Mao et al., 2024). Chuang et al. (2021) gives margin-based generalization bound normalized by optimal transport cost, deploying the properties of data while ignoring the training process.

1.1 Contributions

We analyze how the generalization gap accumulates along the training trajectory. We derive a differential equation describing the evolution of the averaged loss difference, controlled by the contraction factor and the perturbation factor. This equation tells us how perturbation of dataset affects the output of the predictor during training. We define an “effective Gram matrix” for neural network training, that characterizes the accumulation of generalization gap in different subspaces. Using this effective Gram matrix, we derive a complexity measure that faithfully characterizes the generalization gap in general networks. This analysis allows us to get a data-dependent estimate of the generalization gap. Time-varying contraction and perturbation factors along the training trajectory allow us to avoid making uniform assumptions about the loss function. We next describe the contributions of the paper.

•

In Section 3.2, we derive the differential equation for the evolution of the averaged loss difference $\bar{\Delta}_{n}(t)$ . This equation depends upon a certain “contraction factor” $\bar{c}_{n}$ and a “perturbation factor” $\bar{\epsilon}_{n}$ .
•

In Section 3.3 and Section 3.4, we derive the effective Gram matrix $K_{n}$ , and give a complexity measure in terms of the quadratic form $\vec{r}_{n}(0)^{\top}K_{n}\vec{r}_{n}$ , that faithfully characterize the generalization ability of neural networks. This analysis holds for arbitrary networks and loss functions.
•

In Section 4, we calculate numerical approximations of $K_{n}$ and projections of the residuals onto different eigenspaces of $K_{n}$ . We show that simpler tasks and better model architectures benefit the training process due to an effective Gram matrix with smaller eigenvalues and better alignment with the initial residual, resulting in better generalization at the end of training.

2 Preliminaries

Let $[n]$ denote the set of integers $\{1,...,n\}$ . We use the notation $a\cdot b$ to denote the inner product of vectors $a$ , $b$ . For a function $h$ , we write $h(w)|^{b}_{a}\equiv h(b)-h(a)$ and, sometimes, $h(w)|_{a}\equiv h(a)$ . We use $\absolutevalue{\cdot},\norm{\cdot}_{2},\norm{\cdot}_{F}$ for the absolute value of a scalar, $\ell_{2}$ -norm of a vector or a matrix, and the Frobenius norm of a matrix, respectively. We use the notation $g(t)=\Theta(h(t))$ when there exists constants $c_{0},c_{1},t_{0}>0$ such that $c_{0}\leq g(t)/h(t)\leq c_{1}$ for $t\geq t_{0}$ . We omit the subscript $n$ indicating the size of dataset, and $t$ indicating the time, for all quantities defined in this paper when no ambiguity arises.

Dataset

Let $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ be the sample space where $\mathcal{X}$ and $\mathcal{Y}$ are input and output spaces, respectively. Consider a dataset $S_{n}=\{z_{i}=(x_{i},y_{i})\}_{i\in[n]}$ of size $n$ (each $z_{i}\in\mathcal{Z})$ drawn i.i.d. from a distribution $D$ . Let $D^{n}$ denote the distribution of the dataset, i.e., $S_{n}\sim D^{n}$ . Let $S_{n}^{-i}$ denote a modified dataset obtained by removing the $i$ -th datum, i.e., $S_{n}^{-i}=\{z_{1},...,z_{i-1},z_{i+1},...,z_{n}\}$ .

Predictor and the loss function

We consider the predictor $f:\mathcal{W}\times\mathcal{X}\to\mathcal{Y}$ where $\mathcal{W}$ is the weight space. Consider a loss function $\ell:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}^{+}$ . As an example, for the cross-entropy loss on a $C$ -class classification problem, $\mathcal{Y}=\mathbb{R}^{C}$ , and the loss $\ell(f(w,x),y)=-\sum_{j=1}^{C}y^{j}\log p^{j}$ , where $p^{j}=\exp(f^{j}(w,x))/\left(\sum_{j=1}^{C}\exp(f^{j}(w,x))\right)$ with $y^{j}$ and $f^{j}$ denoting the $j$ -th element of $y$ and $f(w,x)$ , respectively. We use the notation $\ell(w,z)\equiv\ell(f(w,x),y)$ as a shorthand. Let $\bar{\ell}(w,S_{n})=(1/n)\sum_{i=1}^{n}\ell(w,z_{i})$ be the average loss over the dataset $S_{n}$ .

Gradient flow

Let $w_{n}(t)$ and $w_{n}^{-i}(t)$ denote solutions corresponding to the gradient flows

\displaystyle\derivative{w}{t}=-\nabla\ell(w,S_{n}),\quad\derivative{w}{t}=-% \nabla\ell(w,S_{n}^{-i}),

(1)

respectively. Unless otherwise specified, we assume that $w_{n}(t)$ and $w_{n}^{-i}(t)$ are initialized at the same point for all $i\in[n]$ . As a precursor, in Section 3, we will chose $w_{n}(0)$ and $w_{n}^{-i}(0)$ to be initializations of neural networks. In Section 4, we will also sometimes initialize $w_{n}(0)$ and $w_{n}^{-i}(0)$ to be the weights of neural networks that are not fully trained, in this case, $w_{n}(0)$ and $w_{n}^{-i}(0)$ are not necessarily the same.

Generalization gap

We will consider a few different measures of performance of a predictor trained with gradient flow. Given a predictor $f$ , a loss function $\ell$ , and an initialization $w_{n}(0)$ , the generalization loss and the train loss of gradient flow trained on $S_{n}$ at time $t$ are

	$\displaystyle R(S_{n},t)$	$\displaystyle=\mathbb{E}_{z}[\ell(w_{n}(t),z)],$
	$\displaystyle R_{\text{train}}(S_{n},t)$	$\displaystyle=\bar{\ell}(w_{n}(t),S_{n}).$

Our main quantity of interest is the generalization gap, defined as their difference

\displaystyle\delta R(S_{n},t)=R(S_{n},t)-R_{\text{train}}(S_{n},t).

The expected values of these quantities will be useful to us in Section 3. They are the expected generalization loss $\operatorname*{\mathbb{E}}_{S_{n}}\left[R(S_{n},t)\right]$ , expected train loss $\operatorname*{\mathbb{E}}_{S_{n}}\left[R_{\text{train}}(S_{n},t)\right]$ and the expected generalization gap $\operatorname*{\mathbb{E}}_{S_{n}}\left[\delta R(S_{n},t)\right]$ . The notations $\operatorname*{\mathbb{E}}_{S_{n}}$ and $\operatorname*{\mathbb{E}}_{z}$ denote expectations with respect to the random draw of dataset $S_{n}$ and the sample $z$ , from distributions $D^{n}$ and $D$ , respectively. We sometimes omit the subscript $S_{n}$ and $z$ in the following sections.

2.1 Contraction theory

This section introduces some preliminary material on contraction theory (Lohmiller and Slotine, 1998, 2000), which provides a way to analyze solutions of slightly different dynamical systems. Contraction theory rewrites Lyapunov theory (Isidori, 1995; Marino and Tomei, 1995) using a quadratic Lyapunov function, defined by a Riemannian contraction metric and its uniform positive definite matrix, characterizing the necessary and sufficient conditions for exponential convergence of the multiple trajectories to each other and the stability of these trajectories to perturbations of the dynamics. Consider a nonlinear dynamical system

\derivative{\xi}{t}=h(\xi,t).

(2)

The following theorem gives guarantees of the exponential convergence of trajectories with different initializations.

Theorem 1 (Theorem 2.1 from Tsukamoto et al. (2021)).

If there exists a uniformly positive definite matrix $M(\xi,t)\succ 0$ for all $\xi,t$ , such that the following condition holds for some $\alpha>0$ ,

\forall\xi,t:\quad\dot{M}+M\nabla_{\xi}h+\nabla_{\xi}h^{\top}M\preceq-2\alpha M,

(3)

then all trajectories of Eq. 2 converge to a single trajectory under the metric induced by $M$ exponentially fast regardless of their initial conditions, i.e. for all trajectories $\xi$ , $\xi^{\prime}$ of Eq. 2, $d(\xi(t),\xi^{\prime}(t))_{M}\leq d(\xi(0),\xi^{\prime}(0))_{M}e^{-\alpha t}$ , where $d(\cdot,\cdot)_{M}$ denotes the distance under the metric induced by $M$ . Dynamical system Eq. 2 satisfying Eq. 3 is said to be “contracting”, under the “contraction metric” induced by $M$ . The factor $\alpha$ is defined to be the “contraction factor”.

Using Theorem 1, we can also analyze trajectories of a perturbed dynamical system

\derivative{\xi}{t}=h(x,t)+b(x,t).

(4)

Let $\xi_{0}(t)$ , $\xi_{1}(t)$ be solutions of Eq. 2 and Eq. 4, respectively. The next theorem shows that for a contracting system, the solution of the perturbed system does not differ too much from that of the original system, under certain conditions.

Theorem 2 (Theorem 2.3 from Tsukamoto et al. (2021)).

Assume that the dynamical system Eq. 2 is contracting under $M$ with factor $\alpha$ . If $\overline{b}=\sup_{x,t}\norm{b(x,t)}$ and there exist constants $\underline{m},\overline{m}>0$ such that $\underline{m}I\preceq M(x,t)\preceq\overline{m}I$ for all $x,t$ , then we have

	$\displaystyle d(\xi_{1}(t),\xi_{0}(t))$	$\displaystyle\leq\frac{d(\xi_{1}(0),\xi_{0}(0))}{\sqrt{\underline{m}}}e^{-% \alpha t}+\frac{\overline{b}}{\alpha}\sqrt{\frac{\overline{m}}{\underline{m}}}% \left(1-e^{-\alpha t}\right),$
	$\displaystyle d(\xi_{1}(t),\xi_{0}(t))_{M}$	$\displaystyle\leq d(\xi_{1}(0),\xi_{0}(0))_{M}e^{-\alpha t}+\frac{\overline{b}% \sqrt{\overline{m}}}{\alpha}\left(1-e^{-\alpha t}\right),$

where $d(\cdot,\cdot)$ , $d(\cdot,\cdot)_{M}$ denote the distance under Euclidean metric and metric induced by $M$ respectively.

In short, for contracting systems, for large times $t$ , the bound of the distance between the solution of the original dynamic and the perturbed dynamic is determined by the perturbation of the system, the contraction factor, and the eigenvalues of the metric. In this paper, we will be interested in using these ideas to understand the difference between two trajectories evaluated on certain loss functions that are fitted using slightly different datasets.

Kozachkov et al. (2023) gave a bound on generalization gap using Theorem 2 by analyzing the difference of gradient flow trajectories trained on datasets with one replaced sample under the assumption that the dynamic is contracting uniformly on the state space with factor $\alpha$ . In Lemma 4, we will define another notion of contraction that does not require a uniform $\alpha$ , or the uniform boundedness of $b$ . This will enable a more refined analysis of the generalization gap.

3 Methods

We first show that the generalization gap can be approximated by the “averaged loss difference” $\bar{\Delta}_{n}(t)$ defined in Section 3.1. We will compute in Section 3.2 how $\bar{\Delta}_{n}(t)$ evolves over time, and show that its dynamics arises from a contraction factor $\bar{c}_{n}$ and a perturbation factor $\bar{\epsilon}_{n}$ . In Section 3.3, we show that a certain “residual” $\vec{r}_{n}(t)$ (precisely, the derivative of the loss with respect to the predictor for each datum) largely controls the perturbation factor. Through the evolution of $\bar{\Delta}_{n}(t)$ and $\vec{r}_{n}(t)$ , we can define an “effective Gram matrix” $K_{n}$ and a complexity measure in terms of a quadratic form $\vec{r}_{n}^{\top}K_{n}\vec{r}_{n}$ that characterizes the generalization gap at the end of training (Section 3.4). Proofs of all theorems and lemmas are deferred to Appendix A.

3.1 Approximation of the generalization gap

We first define two quantities pertaining to the difference between perturbed trajectories. Define the pointwise loss difference to be the difference of trajectories $w_{n}^{-i}(t)$ and $w_{n}(t)$ evaluated in terms of the loss $\ell(w,z_{i})$ ,

\Delta_{n}^{-i}(t)=\ell(w_{n}^{-i}(t),z_{i})-\ell(w_{n}(t),z_{i}),

and the averaged loss difference is defined to be

\bar{\Delta}_{n}(t)=\frac{1}{n}\sum_{i=1}^{n}\Delta_{n}^{-i}(t).

(5)

Note that in the averaged loss difference, we use the Leave-One-Out-Cross-Validation (LOOCV) loss as an estimate of the generalization loss. The following lemma shows how the expected generalization gap can be approximated by $\operatorname*{\mathbb{E}}\left[\bar{\Delta}_{n}\right]$ .

Lemma 3.

Assume that the expected generalization loss $\operatorname*{\mathbb{E}}\left[R(S_{n},t)\right]$ is non-increasing in $n$ , the expected training loss $\operatorname*{\mathbb{E}}\left[R_{\text{train}}(S_{n},t)\right]$ is non-decreasing in $n$ and the expected generalization gap $\operatorname*{\mathbb{E}}\left[\delta R(S_{n},t)\right]$ is non-negative for all $n,t$ . Then

\displaystyle\operatorname*{\mathbb{E}}\left[\delta R(S_{n},t)\right]\leq% \mathbb{E}\left[\bar{\Delta}_{n}(t)\right]\leq\operatorname*{\mathbb{E}}\left[% \delta R(S_{n-1},t)\right].

If we also have $\operatorname*{\mathbb{E}}[\delta R(S_{n},t)]/\operatorname*{\mathbb{E}}[% \delta R(S_{n-1},t)]\to 1$ as $n\to\infty$ , then,

\operatorname*{\mathbb{E}}\left[\delta R(S_{n},t)\right]=\operatorname*{% \mathbb{E}}\left[\bar{\Delta}_{n}(t)\right]+o\left(\operatorname*{\mathbb{E}}% \left[\delta R(S_{n},t)\right]\right).

The concentration of $\bar{\Delta}_{n}(t)$ to $\operatorname*{\mathbb{E}}\left[\bar{\Delta}_{n}(t)\right]$ can also be guaranteed if algorithm stability is assumed (see Lemma 22). Hence, the expected generalization gap $\operatorname*{\mathbb{E}}_{S_{n}}\left[\delta R(S_{n},t)\right]$ can be well approximated by the averaged loss difference $\bar{\Delta}_{n}(t)$ under certain conditions. See Section 3.1, Table S.1 for numerical results of generalization gap and averaged loss difference. We will next study the evolution of $\bar{\Delta}_{n}(t)$ .

3.2 Evolution of the averaged loss difference

The pointwise loss difference $\Delta_{n}^{-i}$ describes the difference of two trajectories with slightly perturbed drifts. By deriving differential equations for the evolution of $\Delta_{n}^{-i}$ and $\bar{\Delta}_{n}$ , we analyze the contraction and perturbation of the trajectories in a way that is non-uniform in both time and space, distinguishing it from classical contraction theory. We first give the following lemma for $\Delta_{n}^{-i}$ .

Lemma 4.

For loss functions $\ell(w,z)$ that is differentiable in $w$ for all $z$ ,

\displaystyle\derivative{\Delta^{-i}_{n}(t)}{t}=-c_{n}^{-i}(t)\Delta_{n}^{-i}(% t)+\epsilon_{n}^{-i}(t),

where the pointwise contraction factor $c_{n}^{-i}(t)$ is given by

\displaystyle c_{n}^{-i}(t)=\frac{\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S% _{n}^{-i})\big{|}^{w_{n}^{-i}(t)}_{w_{n}(t)}}{\Delta_{n}^{-i}(t)},

and the pointwise perturbation factor $\epsilon_{n}^{-i}(t)$ is given by

\displaystyle\epsilon_{n}^{-i}(t)=\nabla\ell(w,z_{i})\cdot\left(\nabla\bar{% \ell}(w,S_{n})-\nabla\bar{\ell}(w,S_{n}^{-i})\right)\bigg{|}_{w_{n}(t)}.

We should note that the lemma can be extended to any piecewise differentiable loss if we define the the gradient $\nabla\ell(w,z)$ at the non-differentiable point to be any constant vector with bounded norm. This covers all of the commonly used architectures and activation functions. To give some intuition of the lemma, the contraction factor $c_{n}^{-i}(t)$ represents a force that pulls two trajectories with the same drift but different values at time $t$ closer together under the loss function $\ell(w,z_{i})$ , while the perturbation factor quantifies the differences between the two trajectories at time $t$ induced by the gradient divergence $\nabla\bar{\ell}(w,S_{n})-\nabla\bar{\ell}(w,S_{n}^{-i})$ .

Remark 5 (Deviations from classical contraction theory).

In classical contraction theory, $\alpha$ and $\overline{b}$ provide a uniform contraction rate and perturbation magnitude over time and trajectories (Theorems 1 and 2). In comparison, $c_{n}^{-i}(t)$ and $\epsilon_{n}^{-i}(t)$ in Lemma 4 are derived directly from the evolution of $\bar{\Delta}_{n}^{-i}(t)$ , which describes only the contraction and perturbation of $w_{n}(t)$ and $w_{n}^{-i}(t)$ , and vary with time. This non-uniformity allows for a more refined analysis of gradient flow for neural networks. Indeed, the energy landscape may not be uniformly good in the entire weight space, but it could be benign along most of the training trajectory. Our development in this paper from here on will therefore diverge significantly from the generalization bounds derived under uniform assumptions on the energy landscape (Kozachkov et al., 2023; Charles and Papailiopoulos, 2018).

By taking the average over the numerator and denominator of $c_{n}^{-i}(t)$ , and averaging over $\epsilon_{n}^{-i}(t)$ in Lemma 4, we obtain the following equation for the averaged loss difference $\bar{\Delta}_{n}(t)$ :

\displaystyle\derivative{\bar{\Delta}_{n}(t)}{t}=-\bar{c}_{n}(t)\bar{\Delta}_{% n}(t)+\bar{\epsilon}_{n}(t).

(6)

The solution of this differential equation can be written in the integral form as

\displaystyle\bar{\Delta}_{n}(t)=\int_{0}^{t}\bar{\epsilon}_{n}(s)\exp\left(% \int_{s}^{t}-\bar{c}_{n}(u)\differential{u}\right)\differential{s}

(7)

with the assumption that $w_{n}(0)=w_{n}^{-i}(0)$ for all $i$ . Here $\bar{c}_{n}(t)$ is defined to be the averaged contraction factor

\displaystyle\bar{c}_{n}(t)=\frac{\frac{1}{n}\sum_{i=1}^{n}\nabla\ell(w,z_{i})% \cdot\nabla\bar{\ell}(w,S_{n}^{-i})\big{|}_{w_{n}(t)}^{w_{n}^{-i}(t)}}{\bar{% \Delta}_{n}(t)},

(8)

and $\bar{\epsilon}_{n}(t)$ is the averaged perturbation factor

\displaystyle\bar{\epsilon}_{n}(t)=\frac{\tr\hat{\Sigma}_{n}(t)}{n-1},\quad% \hat{\Sigma}_{n}(t)=\operatorname*{\text{Cov}}_{z\sim\text{Unif}(S_{n})}\nabla% \ell(w_{n}(t),z),

(9)

where $\hat{\Sigma}_{n}(t)$ represents the covariance matrix of $\nabla\ell(w_{n}(t),z)$ for $z$ sampled uniformly from the dataset $S_{n}$ . We should note that $\bar{\epsilon}_{n}(t)$ is a statistic that depends only on the training samples, while $\bar{c}_{n}(t)$ depends on both training samples and the held-out test samples. Note that by taking the expectation over $\bar{\epsilon}_{n}$ , and the numerator and denominator of $\bar{c}_{n}$ , we get the evolution of $\operatorname*{\mathbb{E}}\left[\bar{\Delta}_{n}(t)\right]$ , which represents the generalization gap better. See Section A.3 for details.

Remark 6 (Classical contraction theory with uniform bounds on contraction and perturbation).

With uniform guarantees $\bar{\epsilon}_{n}(t)\leq\epsilon^{*}$ and $\bar{c}_{n}(t)\geq c^{*}$ for all $t$ for some positive $\epsilon^{*},c^{*}$ , we can solve Eq. 7 to see that

\bar{\Delta}_{n}(t)\leq\frac{\epsilon^{*}}{c^{*}}\left(1-\exp(-c^{*}t)\right),

which derives similar bound as in Theorem 2.

Remark 7 (Comparing trajectories in terms of their loss vs. weight space difference).

Richards and Kuzborskij (2021); Akbari et al. (2021) analyze the difference of algorithm in the weight space when one sample is replaced, and derive generalization bound using the Lipchitz assumption of the loss function. However, in most cases, a uniform Lipchitz constant is far from good for most part of the weight space for deep networks. In such cases, the difference in the weights does not provide a tight estimate of the difference of the predictions—and this is the key reason for loose generalization bounds from this kind of analysis. By comparing the difference of $w_{n}^{-i}(t)$ and $w_{n}(t)$ directly in terms of the loss $\ell(w,z_{i})$ in Lemma 4, instead of the weight space (with or without a modified Euclidean metric) as used in Theorem 1, in this paper, we can achieve a tighter estimate of the evolution of the generalization gap.

Remark 8 (Relationship to information theoretic generalization bounds).

In Negrea et al. (2019); Neu et al. (2021); Banerjee et al. (2022), the authors derive generalization bounds controlled by the sum of trace of the gradient covariance along the training trajectory, $\sum_{t}\tr\hat{\Sigma}_{n}(t)$ . Intuitively, this summation tells us about the size of the tube of trajectories in loss space that arises from training on different datasets. The worse the estimate of this tube, the worse the generalization bound. Our expression in Eq. 7 provides a more general and tighter formulation, where the damping factor $\exp\left(\int_{s}^{t}-\bar{c}_{n}(u)du\right)$ corrects for the size of this tube. A positive contraction factor leads to quicker shrinking of the two trajectories that are being trained on slightly different datasets. Furthermore, our analysis applies to deterministic algorithms, unlike previous works on information-theoretic bounds (Xu and Raginsky, 2017; Mou et al., 2018; Futami and Fujisawa, 2023), which holds only for randomized algorithms because the proof relies on the non-expansiveness of the Kullback-Leibler divergence of non-singular distributions.

Remark 9 (Some intuition on the contraction factor).

If we expand the contraction factor in Lemma 4 using the first-order Taylor expansion in both its numerator and denominator, and by approximating $\ell(w,S_{(m)})$ and $\bar{\ell}(w,S_{n}^{(m)})$ by the loss on the full dataset $\bar{\ell}(w,S_{n})$ , we see that

(10)

This is a general version of Rayleigh quotient $x^{\top}Ay/x^{\top}y$ , where $x\equiv\nabla\bar{\ell}(w_{n}(t),S_{n})$ , $y\equiv\mathbb{E}_{(m)}\left[w_{n}^{-(m)}(t)-w_{n}(t)\right]$ , and $A\equiv\nabla^{2}\bar{\ell}(w_{n}(t),S_{n})$ . Intuitively, positive contraction implies that the Hessian does not change the cosine angle of the gradient and the averaged difference of trajectories. Fig. 1 compares the true contraction factor and the full-gradient approximation Eq. 10. We can see the approximated contraction factor is positive (which indicates contractive dynamics) and that it is also close to the true contraction factor. This suggests that the Hessian of the training loss is positive definite along the directions of gradient and the averaged difference of trajectories, for most of the training time. See Section 4 for the batch version of the contraction factor $\bar{c}_{n}(t)$ , and Section A.8 for the detailed calculations of Eq. 10.

Refer to caption — Figure 1: The contraction factor calculated through its analytical expression in Eq. 8 (orange) compared to its approximation using Eq. 10 (blue) for FC trained on MNIST with two selected classes, $n=1000$ , $m=100$ .

3.3 Evolution of the residual and perturbation

In Section 3.2, we have shown that the evolution of $\bar{\Delta}_{n}$ is controlled by the averaged perturbation $\bar{\epsilon}_{n}$ and the averaged contraction $\bar{c}_{n}$ , and that $\bar{\epsilon}_{n}$ is closely related to the trace of the covariance of gradients. We will now introduce the notion of a “residual”. We further show how it relates to the perturbation factor, from which we derive the evolution of $\bar{\epsilon}_{n}(t)$ .

Let $r(w,z)=\derivative{\ell(w,z)}{f(w,z)}\in\mathcal{Y}$ denote the gradient of the loss function with respect to the predictor $f$ . Let $r_{i}(t)\equiv r(w_{n}(t),z_{i})$ denote predictor gradient on $z_{i}$ , evaluated on weight $w_{n}(t)$ . We define the residual on dataset $S_{n}$ at time $t$ to be

\vec{r}_{n}(t)=\frac{1}{\sqrt{n}}[r_{1}(t),\dots,r_{n}(t)]^{\top}\in\mathcal{Y% }^{n}.

(11)

The residual is the collection of loss-predictor gradients on the dataset $S_{n}$ . It effectively describes the quality of the weights at time $t$ and indicates the direction of the training progresses in the predictor space. Intuitively, it represents the part of the “task” that remains to be fitted at time $t$ . As a special case, if we consider the squared loss $\ell(y,y^{\prime})=\frac{1}{2}(y-y^{\prime})^{2}$ for $y,y^{\prime}\in\mathbb{R}$ , the residual is the normalized displacement vector from the predictor to the target, i.e., $\vec{r}_{n}(t)=\frac{1}{\sqrt{n}}\left(\vec{f}-\vec{y}\right)$ , where $\vec{y}\equiv[y_{1},\dots,y_{n}]^{\top}$ , and $\vec{f}\equiv[f(w_{n}(t),x_{1}),\dots,f(w_{n}(t),x_{n})]^{\top}$ . If we initialize the predictor such that $f(w_{n}(0),x_{i})=0$ for all $i\in[n]$ , then the $\ell_{2}$ -norm of the residual $\norm{\vec{r}_{n}(0)}_{2}$ is the largest at initialization and vanishes at interpolation following the gradient flow Eq. 1. The definition of residual generalizes the displacement vector in the squared loss case, and can be applied to any loss function with global minimum 0. The factor $1/\sqrt{n}$ will be justified in Remark 20.

The evolution of $\vec{r}_{n}(t)$ is governed by the following equation derived from gradient flow in Eq. 1.

	$\displaystyle\derivative{\vec{r}_{n}(t)}{t}$	$\displaystyle=-\frac{1}{n}P_{n}(t)\vec{r}_{n}(t),$		(12)
	$\displaystyle P_{n}(t)$	$\displaystyle=\left[\nabla r(w_{n}(t),z_{i})^{\top}\nabla f(w_{n}(t),x_{j})% \right]_{i,j\in[n]}.$		(12)

This is a linear time-varying ordinary differential equation. In general, its solution can be written as

\vec{r}_{n}(t)=\Omega_{n}(t_{0},t)\ \vec{r}_{n}(t_{0}),

(13)

where $\Omega_{n}(t_{0},t)$ is called the propagator. The numerical approximation of $\Omega_{n}(t_{0},t)$ will be discussed in Section B.4.

Our next goal will be to show that the averaged perturbation factor $\bar{\epsilon}_{n}$ is controlled by the residual. We will do so using the following lemma.

Lemma 10.

The trace of the gradient covariance $\hat{\Sigma}_{n}(t)$ can be decomposed in terms of two matrices $M_{n},H_{n}\in\mathcal{Y}^{n}\times\mathcal{Y}^{n}$ as

\displaystyle\tr\hat{\Sigma}_{n}(t)=\vec{r}_{n}(t)^{\top}\left(M_{n}(t)-\frac{% H_{n}(t)}{n}\right)\vec{r}_{n}(t),

(14)

	$\displaystyle M_{n}(t)$	$\displaystyle=\scalebox{1.0}{\mbox{$\displaystyle\operatorname{\text{diag}}% \left(\nabla f(w,x_{1})^{\top}\nabla f(w,x_{1}),\dots,\nabla f(w,x_{n})^{\top}% \nabla f(w,x_{n})\right)\bigg{\|}_{w_{n}(t)}$}},$		(15)
	$\displaystyle H_{n}(t)$	$\displaystyle=\left[\nabla f(w_{n}(t),x_{i})^{\top}\nabla f(w_{n}(t),x_{j})% \right]_{i,j\in[n]}.$		(15)

Remark 11.

Let us emphasize that all three quantities, $P_{n}(t)$ , $M_{n}(t)$ and $H_{n}(t)$ are elements of $\mathcal{Y}^{n}\times\mathcal{Y}^{n}$ . For regression problems, we might have $\mathcal{Y}\subseteq\mathbb{R}$ in which case they are simply matrices in $\mathbb{R}^{n\times n}$ . For classification problems with $C$ categories, $\mathcal{Y}\subset\mathbb{R}^{C}$ , and therefore these three quantities are four-dimensional tensors. But we can interpret them as elements of $\mathbb{R}^{nC\times nC}$ . This amounts to vectorizing the tensor as a matrix. Just like a matrix may be reshaped into a vector, we have reshaped a tensor into a matrix.

Equation Eq. 13 and the above lemma together give

\tr\hat{\Sigma}_{n}(t)=\vec{r}_{n}(0)^{\top}\Omega_{n}(t)^{\top}\left(M_{n}(t)% -\frac{H_{n}(t)}{n}\right)\Omega_{n}(t)\vec{r}_{n}(0),

(16)

where we denote $\Omega(0,t)$ as $\Omega(t)$ for short. In Eq. 14, the term $M_{n}-H_{n}/n$ pertains to the covariance of the predictor on the training dataset $S_{n}$ . We can see that the residual $\vec{r}_{n}(t)$ controls the magnitude of gradients $\nabla\ell(w,z_{i})$ for $i\in[n]$ hence that of the covariance. For networks that train quickly, the residual norm $\norm{\vec{r}_{n}(t)}_{2}$ vanishes quickly, leading to a smaller accumulation of the perturbation term $\bar{\epsilon}_{n}$ above, and hence a smaller generalization gap — this explains the folklore theorem “networks that generalize well also train quickly”.

Remark 12 (Relationship to weight norm based bound).

Arora et al. (2019) also study the residual dynamics to obtain an estimate of the norm of eventual weights that can be reached by gradient descent. The authors derive a weight-norm-based bound that uses the results of Bartlett et al. (2017) on Rademacher complexity. Similar ideas are adpoted in Allen-Zhu et al. (2019); Cao and Gu (2019), analyzing SGD and online learning of fully connected neural nets respectively. Liu et al. (2022) derives weight norm bound under uniform-LGI conditions for general optimization problems. In contrast, we use the evolution of the residual in Eq. 12 to calculate the trace of the gradient covariance. This is directly related to the difference of the loss of networks trained on perturbed datasets—as opposed to the difference in their weights. Our analysis is conducted directly in the prediction space and provides a more direct and refined characterization of generalization for general neural networks.

3.4 Effective Gram matrix for neural networks

We next derive an expression of averaged loss difference $\bar{\Delta}_{n}(t)$ in terms of a certain quadratic form of the initial residual and an “effective Gram matrix”, by analyzing the evolution of $\bar{\Delta}_{n}(t)$ and $\vec{r}_{n}(t)$ during training. The following theorem combines the solution of $\bar{\Delta}_{n}(t)$ in Eq. 7 and $\vec{r}_{n}(t)$ in Eq. 13, along with the decomposition of $\hat{\Sigma}_{n}(t)$ in Eq. 14.

Theorem 13.

Assume that the evolution of $w_{n}(t)$ and $w_{n}^{-i}(t)$ follows Eq. 1 and the loss function $\ell(w,z)$ is smooth in $w$ for every $z\in\mathcal{Z}$ . We have

\displaystyle\bar{\Delta}_{n}(t)=\vec{r}_{n}(0)^{\top}K_{n}(0,t)\vec{r}_{n}(0),

(17)

where

\displaystyle K_{n}(0,t)=\frac{\int_{0}^{t}\Omega_{n}(s)^{\top}\left(M_{n}(s)-% \frac{H_{n}(s)}{n}\right)\Omega_{n}(s)\exp\left(-\int_{s}^{t}\bar{c}_{n}(u)% \differential u\right)\differential s}{n-1}

(18)

is positive semi-definite. Let

K_{n}\triangleq\lim_{t\to\infty}K_{n}(0,t)

when the limit exists, then we have

\displaystyle\bar{\Delta}_{n}(\infty)\triangleq\lim_{t\to\infty}\bar{\Delta}_{% n}(t)=\vec{r}_{n}(0)^{\top}K_{n}\vec{r}_{n}(0).

We call $K_{n}$ the effective Gram matrix of a neural network.

We call $K_{n}(0,t)$ the “effective Gram matrix” because it is a weighted average of Gram matrices ²²2In linear algebra, the Gram matrix of a set of vectors $v_{1},\dots,v_{n}$ is given by $V^{\top}V$ , where $v_{1},\dots,v_{n}$ are columns of matrix $V$ . of the form $V^{\top}V$ , where $V=\sqrt{\frac{M(s)-H(s)/n}{n-1}}\Omega_{n}(s)$ . We next show the conditions that guarantee the existence of $\lim_{t\to\infty}K_{n}(0,t)$ .

Lemma 14.

Let $m(t)=\frac{1}{n-1}\norm{M_{n}(t)-\frac{H_{n}(t)}{n}}_{2}$ and $\omega(t)=\exp\left(-\frac{2}{n}\int_{0}^{t}\lambda_{\min}(s)\differential s\right)$ . Let $\lambda_{\max}(t)$ and $\lambda_{\min}(t)$ be the largest and smallest eigenvalues of $(P_{n}(t)+P_{n}(t)^{\top})/2$ respectively. If

(i)

$\lim_{t\to\infty}\int_{0}^{t}\omega(s)\ m(s)\differential s$ exists,
(ii)

there exists a constant $B>0$ such that $\absolutevalue{\omega(t)m(t)}\leq B$ for all $t$ , and
(iii)

the contraction factor $\bar{c}_{n}(t)\geq 0$ for all $t\geq 0$ ,

then $\lim_{t\to\infty}K_{n}(0,t)$ exists in $\ell_{2}$ -norm.

Remark 15.

Sometimes the effective Gram matrix calculated from the propagator derived from $P_{n}(t)$ does not converge. But in this case, we can create a perturbed version of $P_{n}(t)$ with a controlled $\lambda_{\min}(t)$ such that the conditions of Lemma 14 are satisfied. This guarantees the convergence of $\lim_{t\to\infty}K_{n}(0,t)$ while preserving the trajectory of $\vec{r}_{n}(t)$ given $\vec{r}_{0}(t)$ . For example, in Section 3.5 we construct $P_{n}^{\varepsilon}(t)$ as a perturbed version of $P_{n}(t)$ .

We should note that to analyze generalization gap via the relation between the residual and the effective Gram matrix meaningfully, $K_{n}(0,t)$ should corresponding to a trajectory that fits the data by time $t$ . This is true only when $\bar{\ell}(w_{n}(t),S_{n})=0$ , which by Eq. 12 also implies that $\bar{\ell}(w_{n}(t^{\prime}),S_{n})=0$ for all $t^{\prime}\geq t$ . Hence we only consider $K_{n}=\lim_{t\to\infty}K_{n}(0,t)$ in the interpolating regime where $\lim_{t\to\infty}\bar{\ell}(w_{n}(t),S_{n})=0$ , instead of a finite time $K_{n}(0,t)$ . This idea is also reflected in Arora et al. (2019), where the authors consider the NTK regime for infinite time, in which case, the training data is fitted perfectly.

Remark 16 (Data and architecture dependent generalization bound).

The quadratic form $\vec{r}_{n}(0)^{\top}K_{n}\vec{r}_{n}(0)$ in Theorem 13 gives a data and architecture dependent measure of complexity that characterizes the generalization gap of general deep neural networks. We will also see in the experimental section this faithfully captures the true generalization gap. Eigenvalues of $K_{n}$ represent the relative contribution to the generalization gap accumulated in the different subspaces during training. If the initial residual (roughly, the distance to the target) predominantly projects onto the subspace of $K_{n}$ with small eigenvalues, the training process is benign, resulting in a small eventual generalization gap (as showed in Section 4.2). This is therefore one of the key quantities that we will track in numerical experiments on different architectures and datasets in Section 4.

Remark 17 (Generalization gap of kernel machines).

The generalization loss in kernel ridge regression (Rakhlin and Liang, 2020; Mallinar et al., 2022) can be expressed in terms of quantities that resemble ours, namely, the alignment of the residuals with the Gram matrix $r(0)^{\top}Kr(0)$ (Arora et al., 2019; Jacot et al., 2020). Our effective Gram matrix generalizes this type of complexity measure to arbitrary deep neural networks and loss functions, going beyond two-layer neural networks with infinite neurons and squared loss. However, unlike kernel ridge regression, where the Gram matrix is derived from a fixed kernel that directly recovers the target function, the effective Gram matrix $K_{n}$ in our setting varies for different datasets and training regimes and does not necessarily coincide with any fixed kernel.

3.5 An example calculation of the (effective) Gram matrix for linear regression

Assume that the sample space is supported on two points with orthonormal inputs, i.e., $\mathcal{Z}=\{(x_{1},y_{1}),(x_{2},y_{2})\}$ , with orthogonal inputs $x_{1}^{\top}x_{2}=0$ each with unit norm, $\norm{x_{1}}_{2}=\norm{x_{2}}_{2}=1$ . We choose the predictor to be $f(w,x)=w^{\top}x$ , and the loss function to be $\ell(y^{\prime},y)=(y^{\prime}-y)^{2}/2$ . We therefore have $\ell(w,y)=(w^{\top}x-y)^{2}/2$ . Consider the dataset $S_{n}$ with $n$ even, where $z_{i}=(x_{1},y_{1})$ when $i\leq n/2$ and $z_{i}=(x_{2},y_{2})$ when $i>n/2$ . Assume that $w_{n}(0)=w^{-i}_{n}(0)=\vec{0}$ for all $i$ which ensures that the initial residual is simply the vector of ground-truth targets $\vec{r}_{n}(0)=y$ . The averaged contraction factor in Eq. 8 is

\displaystyle\bar{c}_{n}(t)=\bar{c}:=\frac{n-2}{2(n-1)}.

and we have from Eq. 12 and Eq. 15 that

\displaystyle M_{n}(t)=I_{n},\quad H_{n}(t)=P_{n}(t)=\operatorname{\text{diag}% }\left(\vec{1}\vec{1}^{\top},\vec{1}\vec{1}^{\top}\right),

with $\vec{1}=[1,\dots,1]\in\mathbb{R}^{n/2}$ . Note that $P_{n}(t)$ is not full rank when $n>2$ . By Lemma 14, the convergence of $\lim_{t\to\infty}K_{n}(0,t)$ is largely controlled by the smallest eigenvalue of $P_{n}(t)$ , which cannot be too small. Hence, to ensure convergence, we define a modified version of $P_{n}(t)$ with small perturbation $\varepsilon(t)$ on its singular subspace, i.e., $P_{n}^{\varepsilon}(t)=U\Lambda^{\varepsilon}U^{\top}$ , with $\Lambda^{\varepsilon}=\operatorname{\text{diag}}\left(n/2,n/2,n\varepsilon(t)/% 2,\dots,n\varepsilon(t)/2\right)$ , $U=[u_{1},\dots u_{n}]$ , where

\displaystyle u_{1}=\sqrt{\frac{2}{n}}\ [1,\dots,1,0,\dots,0],\quad u_{2}=% \sqrt{\frac{2}{n}}\ [0,\dots,0,1,\dots,1].

In this case, when $\varepsilon(t)\equiv 0$ , we have $P_{n}^{\varepsilon}(t)\equiv P_{n}(t)$ . The dynamics $\differential\vec{r}_{n}(t)\differential t=-P^{\varepsilon}_{n}(t)\vec{r}_{n}(% t)/n$ gives the same trajectory of $\vec{r}_{n}(t)$ as Eq. 12, since $\vec{r}_{n}(0)=[y_{1},\dots,y_{1},y_{2},\dots,y_{2}]\in\text{span}(u_{1},u_{2})$ . By setting $\varepsilon(t)=\bar{\varepsilon}\left(1_{[0,1]}(t)+1_{[1,\infty]}(t)/t^{2}\right)$ with $\bar{\varepsilon}\ll 1$ , the effective Gram matrix $K_{n}(0,t)$ can be calculated from Eq. 18 as

\displaystyle K_{n}(0,t)=U\Lambda^{K}(t)U^{\top}

where $\Lambda^{K}(t)=[\lambda^{K}_{1}(t),\dots,\lambda^{K}_{n}(t)]$ , and

\displaystyle\lambda^{K}_{1}(t)=\lambda^{K}_{2}(t)=\Theta(\exp(-\bar{c}t)),% \quad\lambda^{K}_{3}(t)=\dots=\lambda^{K}_{n}(t)=\Theta(1),

indicating that the initial residual $\vec{r}_{n}(0)$ lies in the subspace of $K_{n}=\lim_{t\to\infty}K_{n}(t)$ with zero eigenvalue.

Note that since $\mathcal{Z}$ is supported on only two points, gradient flow $w_{n}(t)$ trained on a dataset containing both these samples generalizes and achieves zero loss for any data distribution $D$ supported on $\mathcal{Z}$ . This coincides with the calculation above, where the averaged loss difference $\bar{\Delta}_{n}(t)$ as predicted by the quadratic form $y^{\top}K_{n}y$ in Theorem 13, approaches zero as $t\to\infty$ . The calculation holds regardless of what fraction of data in $S_{n}$ comes from either of the two points (so long as both are present). Our theorem correctly predicts that the generalization gap goes to zero. Now if we take an expectation, we have

\mathbb{E}[y^{\top}K_{n}y]=\mathbb{E}[\bar{\Delta}_{n}(t)]=\Theta(2^{-n})

because the dataset $S_{n}$ is supported on only one of the samples with probability $2^{-(n-1)}$ . Theorem 13 is therefore providing a tight prediction of the generalization gap.

The solution $w_{n}(t)$ lies in $\text{span}(x_{1},x_{2})$ . When trained on the dataset $S_{n}^{-i}$ , the progress on the direction $x_{\lceil 2i/n\rceil}$ is slightly less than the other direction, which introduces the non-zero averaged loss difference $\bar{\Delta}_{n}(t)$ during training. We should also note that the calculation of the contraction and perturbation factors depends heavily on the sample-wise loss gradient $\nabla\ell(w,z_{i})$ being supported on $\{x_{1},x_{2}\}$ . The clustering of per-sample gradients happens also in the training of neural networks, as shown in Fort and Ganguli (2019). See Section A.7 for details of the above calculation.

Remark 18 (Comparison with Arora et al. (2019)).

Let us use the technique of Arora et al. (2019) for our example. We can bound the generalization gap in terms of the norm of the eventual weights. The Gram matrix of the linear regression described above is $H^{\varepsilon}=P_{n}^{\varepsilon}(t)$ with $\varepsilon(t)=\bar{\varepsilon}$ for some constant $\bar{\varepsilon}\ll 1$ (we choose this perturbed version to guarantee the positive definiteness while not affecting the evolution of the residual). The norm of weights can be bounded by $\sqrt{y^{\top}(H^{\varepsilon})^{-1}y}$ , which gives a generalization bound for 1-Lipschitz loss,

\sqrt{\frac{2y^{\top}(H^{\varepsilon})^{-1}y}{n}}=\sqrt{\frac{2(y_{1}^{2}+y_{2% }^{2})}{n}}.

This is far looser than the actual generalization error, which is $\Theta(2^{-n})$ for 1-Lipschitz loss from the calculation above. The key point to emphasize here is that by characterizing the evolution of the point-wise loss difference using the contraction factor, we can work directly in the prediction space instead of working in the weight space. This is the reason why our estimate of the generalization gap is more accurate.

4 Experimental Validation

Datasets

For experimental validation of our theoretical development, we use a number of different datasets and experimental settings.

•

MNIST (LeCun et al., 1998) and CIFAR-10 (Krizhevsky, 2009) classification datasets, both with 10 categories.
•

Synthetic datasets labeled Syn- $(a,b)$ are created by modifying the labeling function of the MNIST dataset as follows. We first project MNIST images onto the subspace of the empirical second moment matrix corresponding to the $a$ -th to $b$ -th eigenvalue, sorted from the largest to the smallest. We then relabel the MNIST inputs using a fully-connected teacher with random weights, applied to the projected images.
•

Synthetic datasets labeled Gaussian- $\alpha$ are created using Gaussian data from different covariance matrices. Inputs data in Gaussian- $\alpha$ are sampled from the multivariate Gaussian distributions with covariance matrix $A$ , where the $i$ -th eigenvalue is $\exp(-\alpha i)$ . We then project these inputs onto the subspace of the covariance matrix corresponding to the 10 largest eigenvalues and label the original inputs using a fully connected teacher with random weights, applied to the projected inputs.

The rationale for creating these synthetic datasets will become clear as we discuss the experiments, but in short, we seek to create datasets where the signal-to-noise ratio can be controlled. The smaller the value of $a$ and the larger the value of $\alpha$ , the larger the signal-to-noise ratio in these synthetic data.

Architectures

We will train FC (fully connected neural networks), LeNet-5 (LeCun et al., 1998) (a network with two convolutional layers and one fully connected layer), and WRN-4-4 (Zagoruyko and Komodakis, 2016) (wide residual network with 4 layers and a widening factor of 4) using (non-stochastic) gradient descent with different numbers of samples drawn from the datasets described above. We use gradient descent as the Euler approximation of gradient flow in our theory. See Appendix B for more details.

Constructing perturbed datasets

The theory in this paper was written when the modified dataset $S_{n}^{-i}$ with $n-1$ samples is created by omitting the $i$ -th sample. For numerical stability and efficiency of the approximation, in the experiments, we create datasets by omitting a batch of $m$ samples. Let $(m)$ denote a subset of $[n]$ with size $m$ . Let $S_{(m)}=\{z_{i}=(x_{i},y_{i})_{i\in(m)}\}$ . We will conduct experiments using modified datasets $S_{n}^{-(m)}=S_{n}\setminus S_{(m)}$ obtained by removing $S_{(m)}$ from $S_{n}$ . We therefore consider the weight trajectory $w_{n}^{-(m)}(t)$ of the differential equation $\dot{w}=-\nabla\bar{\ell}(w,S^{-(m)}_{n})$ . The averaged loss difference is modified in the usual fashion $\bar{\Delta}_{n}(t)=\operatorname*{\mathbb{E}}_{(m)}\left[\Delta_{n}^{-(m)}\right]$ with the batch-wise loss difference

\Delta_{n}^{-(m)}(t)=\ell(w_{n}^{-(m)}(t),S_{(m)})-\ell(w_{n}(t),S_{(m)}).

Note that $\operatorname*{\mathbb{E}}_{(m)}$ denotes the expectation taken over the uniform distribution on all possible choices of $(m)$ in $[n]$ . The formulae for the averaged contraction and perturbation factors $\bar{c}_{n}(t)$ and $\bar{\epsilon}_{n}(t)$ in this setting are shown in Section B.1.

Calculating quantities that pertain to the generalization gap

We are interested in calculating the effective Gram matrix $K_{n}$ for different configurations of neural network training. To do this, we approximate the gradient flow Eq. 1 by gradient descent with different learning rate. We calculate the averaged contraction factor $\bar{c}_{n}(t)$ , averaged perturbation factor $\bar{\epsilon}_{n}(t)$ , decomposition of the trace of the gradient covariance $M_{n}(t)$ and $H_{n}(t)$ using Eq. 8, Eq. 9 and Eq. 15 (or its alternatives for omitting $m$ samples setting as in Appendix B) respectively. The propagator $\Omega_{n}(t)$ is approximated by product methods as described in Section B.4, where we compared it with the Magnus expansion approximation. The integrals for $\bar{\Delta}_{n}(t)$ in Eq. 6 and $K_{n}$ in Eq. 18 are approximated by the trapezoidal method. We will use the statistics in Table 1 to characterize the relation of the initial residual $\vec{r}_{n}(0)$ and the effective Gram matrix $K_{n}$ . The rationale for defining these quantities comes from Theorem 13 which shows that the eventual generalization gap after training is a quadratic form that depends upon the effective Gram matrix and the initial residual. We are interested in understanding how different subspaces of the effective Gram matrix contribute to this quadratic form.

Notation	Definition
$E(K),\sigma(K)$	The eigenspace and eigenspectrum of a symmetric matrix $K$ with eigenvalue decomposition
	where $K=E(K)\operatorname{\text{diag}}(\sigma(K))E(K)^{\top}$ , $\sigma(K)$ is the vector of eigenspectrum in ascending order.
$\bar{\sigma}(K)$	The mean of the eigenspectrum of a symmetric matrix $K$ , $\bar{\sigma}(K)=\sum_{i}\sigma(K)_{i}/n$ .
$U_{k},U_{1:k}$	The $k$ -th column of $U$ , and the first $k$ columns of $U$ .
$P(r,U)$	Normalized projection of a vector $r$ onto the space $U$ (with orthonormal columns).
	the $k$ -th element $P(r,U)_{k}=\absolutevalue{r^{\top}U_{k}}/\norm{r}_{2}$ .
$M(r,U)$	“Explained magnitude” of a vector $r$ in the space $U_{1:k}$ (with orthonormal columns)
	the $k$ -th element $M(r,U)_{k}=\norm{r^{\top}U_{1:k}}^{2}_{2}/\norm{r}_{2}^{2}$ .
$M(K)$	“Explained magnitude” of a symmetric matrix $K$ in its eigenspace $E(K)$
	the $k$ -th element $M(K)_{k}=\sum_{i=1}^{k}\sigma(K)_{i}/\sum_{i=1}^{n}\sigma(K)_{i}$ for $K\in\mathbb{R}^{n\times n}$ .
$R(\text{Idx})$	“Relative index” of the index vector $\text{Idx}=[1,2,\dots,l]$ , where $R(\text{Idx})=[1/n,2/n,\dots,1]$ .

Table 1: Statistics characterizing the initial residual and effective Gram matrix.

4.1 Theorem 13 leads to a good approximation of the generalization gap

Consider Fig. 2. Observe that the true generalization gap $\delta R(S_{n},t)$ , averaged loss difference $\bar{\Delta}_{n}(t)$ , and the gap $\operatorname*{\mathbb{E}}_{(m)}\left[\delta R(S_{n}^{-(m)},t)\right]$ (denoted by $\delta\bar{R}(\cdot)$ in the plot) are all close to each other throughout training. This indicates that the generalization gap can be well approximated by $\bar{\Delta}_{n}(t)$ .

We calculate two numerical approximations of $\bar{\Delta}_{n}(t)$ : the quantity $\bar{\Delta}_{n}(c,\epsilon,t)$ computed with the true perturbation factor from Eq. 9, and $\bar{\Delta}_{n}(c,\hat{\epsilon},t)$ with an approximate perturbation factor derived from Eq. 16 with the propagator given by the product approximation Eq. 21. First note that $\bar{\Delta}_{n}(c,\epsilon,t)$ is close to $\bar{\Delta}_{n}(t)$ , which indicates that the gradient descent approximation of Eq. 1 and the trapezoidal approximation of Eq. 7 are good. Second, the similarity of $\bar{\Delta}_{n}(c,\hat{\epsilon},t)$ and $\bar{\Delta}_{n}(c,\epsilon,t)$ indicates that the product approximation in Eq. 21 is working well. The results on CIFAR-10 using a convolutional network are largely similar, with slightly less accurate estimates of the generalization gap.

Note that $\bar{\Delta}_{n}(c,\hat{\epsilon},t)=\vec{r}_{n}(0)^{\top}K_{n}(0,t)\vec{r}_{n% }(0)$ for numerically approximated effective Gram matrix $K_{n}(0,t)$ . Hence, the good approximation of the generalization gap by $\bar{\Delta}_{n}(c,\hat{\epsilon},t)$ indicates that the numerically approximated effective Gram matrix $K_{n}$ (which is calculated in the following sections) is a good quantity to use for understanding generalization. We should note that $K_{n}$ refers to the numerical approximations in the following subsections. In Table S.1, we provide a complete list of results of generalization gap approximation for all the experiments.

Paper	Architecture	Dataset	# samples	Training method	Bound on Test Data	Actual Value	Relative inaccuracy
Arora et al. (2019)	FC	MNIST-2	10,000	GD (second layer fixed)	0.05 ( $\ell 1$ loss)	$<$ 0.01 ( $\ell 1$ loss)	$>$ 4
Dziugaite and Roy (2017)	FC	MNIST-2	55,000	SGD	0.161 (error)	0.018 (error)	7.9
Wang and Ma (2022)	FC	MNIST-2	55,000	SGD	0.25 (CE loss)
Ours	FC	MNIST-10	1,100	GD	0.47 (CE loss)	0.45 (CE loss)	0.05
Ours	LENET-5	MNIST-10	1,100	GD	0.24 (CE loss)	0.20 (CE loss)	0.18
Negrea et al. (2019)	CNN	MNIST-10	55,000	SGLD	0.25 (CE loss)	0.02 (error)
Mou et al. (2018)	CNN	MNIST-10	55,000	SGLD	1.25 (CE loss)	0.02 (error)
Ours	FC	CIFAR-2	1,100	GD	0.34 (CE loss)	0.41 (CE loss)	0.17
Arora et al. (2019)	FC	CIFAR-2	10,000	GD (second layer fixed)	0.6 ( $\ell 1$ loss)	0.45 ( $\ell 1$ loss)	0.33
Ours	LENET-5	CIFAR-2	1,100	GD	0.46 (CE loss)	0.49 (CE loss)	0.06
Ours	WRN-4-4	CIFAR-2	1,100	GD	0.111 (CE loss)	0.107 (CE loss)	0.04

Table 2: Comparison with previous results in terms of the relative accuracy of the estimate of the generalization error. See Appendix B for the details of the datasets and architectures. CE loss indicates cross-entropy loss. (S)GD indicates (stochastic) gradient descent, SGLD is stochastic gradient Langevin dynamics. “Bound” in this table refers to the numerical value of the generalization bound. “Actual Value” is test loss or error on held-out test data. “Relative inaccuracy” equal “

|

Bound-Actual Value

|

/ Actual Value”. This characterizes the quality of these estimates. Different papers make different assumptions, apply to quite different models of neural networks, loss functions, training methods, and use different techniques. One must therefore be careful while interpreting this table. Note that, to be consistent with the calculations, all our experiments are conducted with gradient descent, not stochastic gradient descent. Practically, this means that in order to get the network to fit the training data well enough, we need to use small sample sizes.

Table 2 compares previous results of generalization bounds. The small relative inaccuracy of our methods shows good quality of our approximations.

Remark 19.

While we have tabulated the results above, we should emphasize the following three points for interpreting these results. First, it is not meaningful to compare different theories in Table 2 to find a superior theory. One upper bound being better than another numerically says little about the quantity they both bound. If one simply wanted to predict the generalization gap well, one would be content with using just cross-validation, see (Kawaguchi et al., 2018, Section 4). However, this does not mean that there is no need to do any theory. The goal of work on generalization is to understand what properties of data, architectures and training lead to good generalization. Each work is answering a different facet of this question. This is why there is a big diversity of assumptions, techniques and conclusions. Second, one cannot compare these methods against each other since they have different assumptions on the architecture and training method. Hence, in Table 2, we compare the relative accuracy, which is $|$ approximation - actual $|$ / actual. This is a reasonable way to compare these approaches. And our approach indeed does very well. Third, our theory is for gradient flow. Thus, our implementation uses gradient descent, not stochastic gradient descent. In practice we cannot get a small training error for these datasets with gradient descent. And this is why we use fewer samples.

4.2 Initial residual lies primarily in the subspace of effective Gram matrix with small eigenvalues

Fig. 3 (left) shows that for MNIST, the initial residual $\vec{r}_{n}(0)$ lies primarily in the subspace of $K_{n}$ with small eigenvalues, while for the random task, the initial residual put more weights into subspace with larger eigenvalues, where the projection is not negligible even for the head eigenvalues. In Fig. 3 (right), the tail subspace of $K_{n}$ with less than 3% of the eigenvalues recovers 98% of $\vec{r}_{n}(0)$ . This shows that if the task that we need to fit is simple, in the sense that the initial residual predominantly lies in the tail subspace of $K_{n}$ , then the eventual generalization gap is small (the generalization gaps for MNIST and random task are 0.47 and 3.27 respectively). This indicates a benign training process, i.e., the generalization loss accumulates slowly.

Remark 20 (Comparing the statistics for different numbers of samples).

The effective Gram matrix $K_{n}\in\mathcal{Y}^{n}\times\mathcal{Y}^{n}$ lies in a different space when neural networks are trained with different numbers of samples $n$ . Therefore, to compare quantities like $\sigma(K_{n})$ , $M(\vec{r}_{n},E(K_{n}))$ and $M(K_{n})$ for different $n$ and the same $\mathcal{Y}$ , we use a “relative index” as described in Table 1. We rescale the original index vector to have indices from zero to one. We should emphasize that by normalizing the residual by $\sqrt{n}$ in Eq. 11, the $\ell_{2}$ -norm of the initial residuals $\norm{\vec{r}_{n}(0)}_{2}$ is similar when $\mathcal{Y}$ is the same, even if $n$ is different. Note that the estimated generalization gap $\vec{r}_{n}(0)^{\top}K_{n}\vec{r}_{n}(0)$ is the average of the eigenvalues ${\sigma(K_{n})}_{i}$ , each weighted by the projected residual $P(\vec{r}_{n}(0),E(K_{n}))_{i}^{2}$ . We therefore also compute $\bar{\sigma}(K_{n})$ to understand the effect of $K_{n}$ . Table S.1 details the numerical values of these quantities for different datasets and architectures.

4.3 As training proceeds, the residual projects more into the principal subspace of the effective Gram matrix

We next consider the training process starting from different times $t_{0}$ instead of $t_{0}=0$ . Analogously to what we have done in Section 3, the increment of the averaged loss difference $\bar{\Delta}_{n}(\infty)-\bar{\Delta}_{n}(t_{0})$ from time $t_{0}$ to the end of training can be approximated by $\vec{r}_{n}(t_{0})^{\top}K_{n}(t_{0})\vec{r}_{n}(t_{0})$ . The effective Gram matrix $K_{n}(t_{0})$ for the training process starting from $t_{0}$ can be calculated using revised contraction and perturbation factors $\bar{c}_{n}$ and $\bar{\epsilon}_{n}$ . The detailed calculation is given in Section B.2.

From Fig. 4, as training proceeds, the residual $\vec{r}_{n}(t_{0})$ aligns more and more with the subspace of the effective Gram matrix $K_{n}(t_{0})$ with large eigenvalues. This is because in the initial phases of training, the residual is first fitted in the subspace with small eigenvalues, and this accumulates the generalization gap slowly. As training proceeds, to reduce the training loss, the network updates the residual to lie in less benign subspaces, those with larger eigenvalues.

4.4 Effective Gram matrix for different datasets

Fig. 6 compares the normalized projection of residual and eigenspectra of the effective Gram matrix for different synthetic datasets. From the classical analysis of linear regression, we know that data is more difficult to learn when labels are correlated with features corresponding to smaller proportions of eigenvalues of the input correlation matrix. In the relabeled MNIST datasets, Syn- $(a,b)$ with larger $a$ labels with less prominent features, and in the Gaussian datasets, Gaussian- $\alpha$ with smaller $\alpha$ puts less weight on the top eigenvalues as showed in Yang et al. (2022). In both cases, we manually created difficult tasks. Using experiments on synthetic datasets with different levels of difficulty, we see that for difficult tasks, the residual projects more onto the subspace corresponding to larger eigenvalues, and the effective Gram matrix $K_{n}$ has larger magnitude, which jointly lead to a larger predicted generalization gap by our theory. And indeed, the true generalization gap corroborates this trend.

Fig. 7 compares the training on MNIST and CIFAR-10. The initial residual of MNIST projects more in the eigenspace of the effective Gram matrix with small eigenvalues, and the eigenvalues of $K_{n}$ for the training of CIFAR are uniformly larger than that of MNIST. This shows that both good task-Gram matrix alignment and the small magnitude of the eigenvalues of $K_{n}$ are necessary for a “benign training process” and a good eventual generalization gap.

4.5 Effective Gram matrix for different architectures

Fig. 9 compares the normalized projection of residual and eigenvalues of the effective Gram matrix for MNIST and CIFAR when trained using different models (FC, LeNet-5 and WRN-4-4). The eigenspectrum $\sigma(K_{n})$ of FC is uniformly larger than that of LeNeT-5 trained with MNIST. Similarly, $\sigma(K_{n})$ of FC and LeNet-5 is larger than that of WRN-4-4 when trained with CIFAR. The large magnitude of the effective Gram matrix leads to a large generalization gap accumulation in all subspaces, resulting in worse generalization.

To demonstrate that our theory also applies to models other than neural networks, in Fig. 10, we fit ridgeless kernel regression (Rakhlin and Liang, 2020) with neural tangent kernel $K_{t_{\text{ker}}}(x,x^{\prime})=\nabla f(w_{n}(t),x)^{\top}\nabla f(w_{n}(t),% x^{\prime})$ using cross-entropy loss for different times $t_{\text{ker}}$ (we manually choose the time points $t_{\text{ker}}$ so that they spread out over the full training process). Note that the kernel here is the standard NTK, which is not related to our effective Gram matrix. The evolution of the predictor is

\displaystyle\derivative{f_{t}(x)}{t}=-\frac{1}{n}\sum_{i=1}^{n}K_{t_{\text{% ker}}}(x,x_{i})r_{t}(z_{i}),\quad r_{t}(z_{i})=\derivative{\ell(f_{t}(x),y)}{f% _{t}(x)}.

Using a fixed Jacobian at initialization leads to larger eigenvalues and more projection of the residual onto the stiff subspaces of the effective Gram matrix $K_{n}$ , i.e., a larger eventual generalization gap, as is widely known (Fort et al. (2020)).

4.6 Effective Gram matrix for different number of samples

Fig. 11 compares the training of datasets with different sizes. When $n$ becomes larger, the initial residual of MNIST projects more in the tail subspaces of the effective Gram matrix $K_{n}$ , and the tail eigenvalue of $K_{n}$ becomes smaller. This coincides with the smaller generalization gap as we train with more samples.

5 Conclusion

We identified key quantities in the training process that control the generalization gap, namely, a contraction factor that brings trajectories on different datasets together, and a perturbation factor that arises from the differences in the sample sets. The merit of our analysis is that it can succinctly and faithfully characterize the generalization gap—of general neural networks. The expression in Theorem 13 depends only on the initial residual $\vec{r}_{n}(0)$ and the effective kernel $K_{n}(0,t)$ . It is important to emphasize that this effective kernel is designed to understand the generalization gap, not the training dynamics. The existence and utility of this kernel indicates that we might be able to fruitfully think of deep networks in cohort with other models in a machine learning practitioner’s toolkit—perhaps they are not as anomalous as they appear to be.

Acknowledgment

This work was funded by grants provided by the National Science Foundation (IIS-2145164, CCF-2212519).

\c@NAT@ctr

References

Akbari et al. (2021) A. Akbari, M. Awais, M. Bashar, and J. Kittler. How does loss function affect generalization performance of deep learning? application to human age estimation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 141–151. PMLR, 2021.
Allen-Zhu et al. (2019) Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems, volume 32, pages 6158–6169, 2019.
Arora et al. (2019) S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the 36th International Conference on Machine Learning, 2019.
Ba et al. (2016) J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
Banerjee et al. (2022) A. Banerjee, T. Chen, X. Li, and Y. Zhou. Stability based generalization bounds for exponential family Langevin dynamics. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 1412–1449. PMLR, 17–23 Jul 2022.
Bartlett et al. (2017) P. L. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, volume 30, pages 6240–6249, 2017.
Bartlett et al. (2020) P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. doi: 10.1073/pnas.1907378117.
Belkin et al. (2018) M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 541–549. PMLR, 10–15 Jul 2018.
Belkin et al. (2020) M. Belkin, D. Hsu, and J. Xu. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020. doi: 10.1137/20M1336072.
Bousquet and Elisseeff (2002) O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499–526, 2002.
Bowman and Montúfar (2022) B. Bowman and G. Montúfar. Spectral bias outside the training set for deep networks in the kernel regime. In Advances in Neural Information Processing Systems, volume 35, 2022.
Cao and Gu (2019) Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, volume 32, pages 10836–10846, 2019.
Charles and Papailiopoulos (2018) Z. Charles and D. Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 745–754. PMLR, 10–15 Jul 2018.
Chizat and Bach (2018) L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems, volume 31, pages 3036–3046. Curran Associates, Inc., 2018.
Chu and Raginsky (2023) Y. Chu and M. Raginsky. A unified framework for information-theoretic generalization bounds. In Advances in Neural Information Processing Systems, volume 36, pages 79260–79278, 2023.
Chuang et al. (2021) C.-Y. Chuang, Y. Mroueh, K. Greenewald, A. Torralba, and S. Jegelka. Measuring generalization with optimal transport. In Advances in Neural Information Processing Systems, volume 34, pages 3031–3044, 2021.
Du et al. (2019) S. S. Du, X. Zhai, B. Póczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
Dziugaite and Roy (2017) G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence, 2017.
Fort and Ganguli (2019) S. Fort and S. Ganguli. Emergent properties of the local geometry of neural loss landscapes. arXiv preprint arXiv:1910.05929, 2019.
Fort et al. (2020) S. Fort, G. K. Dziugaite, M. Paul, S. Kharaghani, D. M. Roy, and S. Ganguli. Deep learning versus kernel learning: An empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In Advances in Neural Information Processing Systems, volume 33, pages 5850–5861, 2020.
Futami and Fujisawa (2023) F. Futami and M. Fujisawa. Time-independent information-theoretic generalization bounds for sgld. arXiv preprint arXiv:2311.01046, 2023.
Hafez-Kolahi et al. (2020) H. Hafez-Kolahi, Z. Golgooni, S. Kasaei, and M. Soleymani. Conditioning and processing: Techniques to improve information-theoretic generalization bounds. In Advances in Neural Information Processing Systems, volume 33, pages 3492–3503. Curran Associates, Inc., 2020.
Hardt et al. (2016) M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1225–1234, New York, NY, USA, June 20–22 2016. PMLR.
Hastie et al. (2022) T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022. doi: 10.1214/21-AOS2133.
Isidori (1995) A. Isidori. Nonlinear Control Systems. Communications and Control Engineering. Springer-Verlag, London, 3rd edition, 1995. doi: 10.1007/978-1-84628-615-5.
Jacot et al. (2018) A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31, pages 8571–8580. Curran Associates, Inc., 2018.
Jacot et al. (2020) A. Jacot, B. Şimşek, F. Spadaro, C. Hongler, and F. Gabriel. Kernel alignment risk estimator: Risk prediction from training data. In Advances in Neural Information Processing Systems, volume 33, pages 15568–15578, 2020.
Kawaguchi et al. (2018) K. Kawaguchi, L. P. Kaelbling, and Y. Bengio. Generalization in deep learning. Technical report, Massachusetts Institute of Technology, 2018.
Kawaguchi et al. (2022) K. Kawaguchi, Z. Deng, K. Luh, and J. Huang. Robustness implies generalization via data-dependent generalization bounds. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 10866–10894. PMLR, 17–23 Jul 2022.
Kozachkov et al. (2023) L. Kozachkov, P. M. Wensing, and J.-J. E. Slotine. Generalization as dynamical robustness—the role of riemannian contraction in supervised learning. Transactions on Machine Learning Research, 2023.
Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
Laurent and von Brecht (2018) T. Laurent and J. von Brecht. Deep linear networks with arbitrary loss: All local minima are global. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2902–2907. PMLR, 2018.
LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Li and Liang (2018) Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, volume 31, pages 8157–8166. Curran Associates, Inc., 2018.
Liu et al. (2022) F. Liu, H. Yang, S. Hayou, and Q. Li. From optimization dynamics to generalization bounds via lojasiewicz gradient inequality. Transactions on Machine Learning Research, 2022. Accepted; to appear.
Lohmiller and Slotine (1998) W. Lohmiller and J.-J. E. Slotine. On contraction analysis for non-linear systems. Automatica, 34(6):683–696, 1998. ISSN 0005-1098.
Lohmiller and Slotine (2000) W. Lohmiller and J.-J. E. Slotine. Nonlinear process control using contraction theory. AIChE Journal, 46(3):588–596, March 2000. doi: 10.1002/aic.690460316.
Lugosi and Neu (2022) G. Lugosi and G. Neu. Generalization bounds via convex analysis. In Proceedings of the 35th Conference on Learning Theory (COLT), volume 178 of Proceedings of Machine Learning Research, pages 3523–3544. PMLR, 2022.
Magnus (1954) W. Magnus. On the exponential solution of differential equations. Communications on Pure and Applied Mathematics, 7(4):649–673, 1954. doi: 10.1002/cpa.3160070404.
Mallinar et al. (2022) N. Mallinar, J. B. Simon, A. Abedsoltan, P. Pandit, M. Belkin, and P. Nakkiran. Benign, tempered, or catastrophic: A taxonomy of overfitting. In Advances in Neural Information Processing Systems, volume 35, pages 29912–29925. Curran Associates, Inc., 2022.
Mao et al. (2024) J. Mao, I. Griniasty, H. K. Teoh, R. Ramesh, R. Yang, M. K. Transtrum, J. P. Sethna, and P. Chaudhari. The training process of many deep networks explores the same low-dimensional manifold. Proceedings of the National Academy of Sciences, 121(12):e2310002121, 2024. doi: 10.1073/pnas.2310002121.
Marino and Tomei (1995) R. Marino and P. Tomei. Nonlinear Control Design: Geometric, Adaptive, and Robust. Prentice Hall, London, 1995. ISBN 978-0133426359.
McAllester (1999) D. A. McAllester. Pac-bayesian model averaging. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT), pages 164–170. ACM, 1999. doi: 10.1145/307400.307435.
McDiarmid (1989) C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, 1989 (Norwich, 1989), volume 141 of London Mathematical Society Lecture Note Series, pages 148–188. Cambridge University Press, 1989.
Mei and Montanari (2022) S. Mei and A. Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022. doi: 10.1002/cpa.22008.
Mei et al. (2019) S. Mei, T. Misiakiewicz, and A. Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Proceedings of the 32nd Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2388–2464. PMLR, 2019.
Mou et al. (2018) W. Mou, L. Wang, X. Zhai, and K. Zheng. Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. In Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 605–638. PMLR, 06–09 Jul 2018.
Negrea et al. (2019) J. Negrea, M. Haghifam, G. K. Dziugaite, A. Khisti, and D. M. Roy. Information-theoretic generalization bounds for sgld via data-dependent estimates. In Advances in Neural Information Processing Systems, volume 32, 2019.
Neu et al. (2021) G. Neu, G. K. Dziugaite, M. Haghifam, and D. M. Roy. Information-theoretic generalization bounds for stochastic gradient descent. In Proceedings of the 34th Annual Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3526–3546. PMLR, 2021.
Neyshabur et al. (2018) B. Neyshabur, S. Bhojanapalli, and N. Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
Rakhlin and Liang (2020) A. Rakhlin and T. Liang. Just interpolate: Kernel ’ridgeless’ regression can generalize. Annals of Statistics, 48(3):1329–1347, 2020. doi: 10.1214/19-AOS1849.
Ramesh et al. (2024) R. Ramesh, A. Bisulco, R. W. DiTullio, L. Wei, V. Balasubramanian, K. Daniilidis, and P. Chaudhari. Many perception tasks are highly redundant functions of their input data. arXiv preprint arXiv:2407.13841, 2024.
Richards and Kuzborskij (2021) D. Richards and I. Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel. In Advances in Neural Information Processing Systems, volume 34, pages 21812–21823, 2021.
Steinke and Zakynthinou (2020) T. Steinke and L. Zakynthinou. Reasoning about generalization via conditional mutual information. In Proceedings of the 33rd Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3437–3452. PMLR, 2020.
Tsukamoto et al. (2021) H. Tsukamoto, S.-J. Chung, and J.-J. Slotine. Contraction theory for nonlinear stability analysis and learning-based control: A tutorial overview. Annual Reviews in Control, 52:135–148, October 2021. doi: 10.1016/j.arcontrol.2021.10.001.
Valiant (1984) L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984. doi: 10.1145/1968.1972.
Wang and Ma (2022) M. Wang and C. Ma. Generalization error bounds for deep neural networks trained by sgd. arXiv preprint arXiv:2206.03299, 2022.
Xu and Raginsky (2017) A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, volume 30, pages 2524–2533, 2017.
Xu and Mannor (2012) H. Xu and S. Mannor. Robustness and generalization. Machine Learning, 86(3):391–423, 2012. doi: 10.1007/s10994-011-5268-1.
Yang et al. (2022) R. Yang, J. Mao, and P. Chaudhari. Does the data induce capacity control in deep learning? In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 25348–25368. PMLR, 2022.
Zagoruyko and Komodakis (2016) S. Zagoruyko and N. Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC). BMVA Press, 2016.

Appendix A Proofs and Calculations in Section 3

A.1 Proof of Lemma 3

By the definition of the averaged loss difference $\bar{\Delta}_{n}(t)$ ,

\displaystyle\bar{\Delta}_{n}(t)=\frac{1}{n}\left(\sum_{i=1}^{n}\ell(w_{n}^{-i% }(t),z_{i})\right)-\bar{\ell}(w_{n}(t),S_{n})

Taking the expectation on both sides, we have

\displaystyle\operatorname*{\mathbb{E}}\left[\bar{\Delta}_{n}(t)\right]=% \operatorname*{\mathbb{E}}\left[R(S_{n-1},t)\right]-\operatorname*{\mathbb{E}}% \left[R_{\text{train}}(S_{n},t)\right]

By assumption, we have

\displaystyle\operatorname*{\mathbb{E}}\left[R(S_{n},t)\right]\leq% \operatorname*{\mathbb{E}}\left[R(S_{n-1},t)\right],\quad\operatorname*{% \mathbb{E}}\left[R_{\text{train}}(S_{n-1},t)\right]\leq\operatorname*{\mathbb{% E}}\left[R_{\text{train}}(S_{n},t)\right].

Therefore,

\displaystyle\operatorname*{\mathbb{E}}\left[\delta R(S_{n},t)\right]\leq% \mathbb{E}\left[\bar{\Delta}_{n}(t)\right]\leq\operatorname*{\mathbb{E}}\left[% \delta R(S_{n-1},t)\right].

By the assumption that $\operatorname*{\mathbb{E}}[\delta R(S_{n},t)]/\operatorname*{\mathbb{E}}[% \delta R(S_{n-1},t)]\to 1$ as $n\to\infty$ , we have that

\displaystyle\frac{\operatorname*{\mathbb{E}}[\delta R(S_{n-1},t)]-% \operatorname*{\mathbb{E}}[\delta R(S_{n},t)]}{\operatorname*{\mathbb{E}}[% \delta R(S_{n},t)]}\to 0

as $n\to\infty$ . Hence,

\operatorname*{\mathbb{E}}\left[\delta R(S_{n},t)\right]=\operatorname*{% \mathbb{E}}\left[\bar{\Delta}_{n}(t)\right]+o\left(\operatorname*{\mathbb{E}}% \left[\delta R(S_{n},t)\right]\right).

The scenarios when the assumptions in Lemma 3 hold

•

The expected generalization loss $\operatorname*{\mathbb{E}}[R(S_{n},t)]$ is non-increasing in $n$ : This holds for ridgeless linear regression without label noise. When label noise is non-zero, the generalization loss decays monotonically when the number of samples is greater than the number of features. This result also holds for ridge regression when the ridge coefficient $\lambda$ decays with $n$ , but not too fast, i.e., $\lambda>\frac{\sigma^{2}}{\sigma^{2}+\norm{\theta^{*}}^{2}}\cdot\frac{1}{n}$ where $\sigma$ is the noise variance and $\theta^{*}$ is the true regressor. See Hastie et al. (2022) for reference. Similar results hold for kernel regression and random feature regression-based architectures Mei and Montanari (2022), which are both widely used models in the analysis of neural networks. For consistent estimators, the generalization loss converges to the Bayes risk asymptotically. Although this decrease need not be strictly monotonic. Estimators like Empirical Risk Minimizer (ERM), Structural Risk Minimization (SRM) are consistent under mild assumptions on the hypothesis class, e.g., having finite capacity.
•

The expected training loss $\operatorname*{\mathbb{E}}[R_{\text{train}}(S_{n},t)]$ is non-decreasing: This holds for ridgeless linear regression in general, and therefore for kernel regression and random feature-based models of neural networks. Note that this assumption can be modified slightly to be $\operatorname*{\mathbb{E}}[R_{\text{train}}(S_{n-1},t)]\leq\operatorname*{% \mathbb{E}}[R_{\text{train}}(S_{n},t)]+B/n$ . This new condition holds for empirical risk minimization (ERM) with bounded loss $|\ell(w,z)|\leq B$ in general. The resulting left-hand side of the inequality in Lemma 3 gets an additive term of $B/n$ correspondingly. The rest of our calculations stay as they are.
•

The expected generalization gap $\operatorname*{\mathbb{E}}[\delta R(S_{n},t)]$ is non-negative: This holds for empirical risk minimization (ERM), in general.

Concentration of $\bar{\Delta}_{n}(t)$ to $\mathbb{E}[\bar{\Delta}_{n}(t)]$

We first define the notion of stability for deterministic algorithm $\mathcal{A}$ that maps from space of datasets to weight space, i.e. $\mathcal{A}:\cup_{n=0}^{\infty}\mathcal{Z}^{n}\to\mathcal{W}$ .

Definition 21.

An algorithm $\mathcal{A}$ is uniformly $\varepsilon$ -stable if for all datasets $S$ , $S^{\prime}$ differing in at most one sample, we have

\displaystyle\sup_{z}\absolutevalue{\ell(\mathcal{A}(S),z)-\ell(\mathcal{A}(S^% {\prime}),z)}\leq\varepsilon

Now we define the set of algorithms $\Gamma$ that maps dataset $S$ to points on the gradient flow trajectory trained on $S$ at certain time points.

\displaystyle\Gamma=\bigg{\{}\mathcal{A}:\mathcal{A}(S)=w(t),t\geq 0,\text{$w$% satisfies}\derivative{w}{t}=-\nabla\bar{\ell}(w,S),w(0)\in\mathcal{W},S\in% \cup_{n\in\mathbb{N}}\mathcal{Z}^{n}\bigg{\}}

Lemma 22.

Assume that (1) $\absolutevalue{\ell(w,z)}\leq B$ for all $w\in\mathcal{W},z\in\mathcal{Z}$ , (2) $\forall\mathcal{A}\in\Gamma$ , $\mathcal{A}$ is $\varepsilon$ -stable, then for all $t>0$ , with probability $1-\delta$ ,

\displaystyle\absolutevalue{\bar{\Delta}_{n}(t)-\mathbb{E}[\bar{\Delta}_{n}(t)% ]}\leq(n\varepsilon+2B)\sqrt{\frac{2\log(2/\delta)}{n}}

Proof.

Let $\tilde{S}_{n}$ denote a modified dataset of $S_{n}$ by replacing the sample $z_{j}$ with a different sample $\tilde{z}_{j}$ . Let $\tilde{w}_{n}(t)$ , $\tilde{w}_{n}^{-i}(t)$ be the corresponding trajectories trained with $\tilde{S}_{n}$ and $\tilde{S}_{n}^{-i}$ (the removed- $i$ th sample version of $S_{n}$ ). Note that $\tilde{w}_{n}^{-j}(t)=w_{n}^{-j}(t)$ . Let $\bar{\Delta}(\tilde{S}_{n},t)$ and $\bar{\Delta}(S_{n},t)$ be the averaged loss difference calculated on $\tilde{S}_{n}$ and $S_{n}$ respectively. By assumptions (1) and (2), we have

	$\displaystyle\|\bar{\ell}(\tilde{w}_{n}(t),\tilde{S}_{n})-\bar{\ell}(w_{n}(t),S% _{n})\|$	$\displaystyle\leq\frac{(n-1)\varepsilon}{n}+\frac{2B}{n}\leq\varepsilon+\frac{% 2B}{n}$
	$\displaystyle\|\ell(\tilde{w}_{n}^{-i}(t),z_{i})-\ell(w_{n}^{-i}(t),z_{i})\|$	$\displaystyle\leq\varepsilon\quad\forall i\neq j$
	$\displaystyle\|\ell(\tilde{w}_{n}^{-j}(t),z_{j})-\ell(w_{n}^{-j}(t),z_{j})\|$	$\displaystyle\leq\frac{2B}{n}$

Hence we have

\displaystyle\absolutevalue{\bar{\Delta}(\tilde{S}_{n},t)-\bar{\Delta}(S_{n},t% )}\leq 2\varepsilon+\frac{4B}{n}

(19)

Inequality Eq. 19 gives the replace-one-sample difference of $\bar{\Delta}_{n}$ , hence by McDiarmid’s inequality (McDiarmid, 1989), we have the following concentration inequality,

\displaystyle\mathbb{P}_{S_{n}}\left[|\bar{\Delta}_{n}-\mathbb{E}[\bar{\Delta}% _{n}]|\geq a\right]\leq 2\exp\left(-\frac{2a^{2}}{n(2\varepsilon+4B/n)^{2}}\right)

Setting the right hand side to $\delta$ , we have with probability at least $1-\delta$ ,

\displaystyle\absolutevalue{\bar{\Delta}_{n}-\mathbb{E}[\bar{\Delta}_{n}]}\leq% (n\varepsilon+2B)\cdot\sqrt{\frac{2\log(2/\delta)}{n}}.

∎

Remark 23.

In general, the convergence of $\bar{\Delta}_{n}$ to $\mathbb{E}\left[\bar{\Delta}_{n}\right]$ can be guaranteed by different versions of algorithm stability (e.g. hypothesis stability, pointwise hypothesis stability and uniform stability (Bousquet and Elisseeff, 2002)). Charles and Papailiopoulos (2018) shows that the algorithm $\mathcal{A}$ is $C(L,\mu)/(n-1)$ -uniformly stable if $\ell(w,z)$ is $L$ -Lipchitz in $w$ and $\bar{\ell}(w,S)$ is $\mu$ -PL (Polyak Lojasiewicz) in $w$ , where $C(L,\mu)$ is a constant depending on $L$ and $\mu$ . Other versions of stability can also be guaranteed by PL and QG (quadratic growth) conditions as showed in Charles and Papailiopoulos (2018).

A.2 Proof of Lemma 4

By taking the derivative of the pointwise loss difference $\Delta_{n}^{-i}(t)$ , we have,

	$\displaystyle\frac{\differential\Delta_{n}^{-i}(t)}{\differential t}$	$\displaystyle=\frac{\differential\left(\ell(w_{n}^{-i}(t),z_{i})-\ell(w_{n}(t)% ,z_{i})\right)}{\differential t}$
		$\displaystyle=-\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})\big{\|}_{% w_{n}^{-i}(t)}-\left(-\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n})\big{\|}% _{w_{n}(t)}\right)$
		$\displaystyle=-\left(\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})% \big{\|}_{w_{n}^{-i}(t)}-\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})% \big{\|}_{w_{n}(t)}\right)$
		$\displaystyle+\left(-\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})% \big{\|}_{w_{n}(t)}+\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n})\big{\|}_{w% _{n}(t)}\right)$
		$\displaystyle=-\left(\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})% \big{\|}_{w_{n}^{-i}(t)}-\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})% \big{\|}_{w_{n}(t)}\right)$
		$\displaystyle+\nabla\ell(w,z_{i})\left(\nabla\bar{\ell}(w,S_{n})-\nabla\bar{% \ell}(w,S_{n}^{-i})\right)\big{\|}_{w_{n}(t)}$

Hence,

\displaystyle\derivative{\Delta^{-i}_{n}(t)}{t}=-c_{n}^{-i}(t)\Delta_{n}^{-i}(% t)+\epsilon_{n}^{-i}(t),

where

\displaystyle c_{n}^{-i}(t)=\frac{\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S% _{n}^{-i})\big{|}^{w_{n}^{-i}(t)}_{w_{n}(t)}}{\Delta_{n}^{-i}(t)},

and

\displaystyle\epsilon_{n}^{-i}(t)=\nabla\ell(w,z_{i})\cdot\left(\nabla\bar{% \ell}(w,S_{n})-\nabla\bar{\ell}(w,S_{n}^{-i})\right)\bigg{|}_{w_{n}(t)}.

A.3 Evolution of $\bar{\Delta}_{n}(t)$

The evolution of $\bar{\Delta}_{n}(t)$ can be derived through that of $\Delta_{n}^{-i}(t)$ .

	$\displaystyle\frac{\differential\bar{\Delta}_{n}(t)}{\differential t}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\frac{\differential\Delta_{n}^{-i}(t)}{% \differential t}$
		$\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\left(c_{n}^{-i}(t)\Delta_{n}^{-i}(t)+% \epsilon_{n}^{-i}(t)\right)$
		$\displaystyle=-\frac{\frac{1}{n}\sum_{i=1}^{n}c_{n}^{-i}(t)\Delta_{n}^{-i}(t)}% {\bar{\Delta}_{n}(t)}\bar{\Delta}_{n}(t)+\frac{1}{n}\sum_{i=1}^{n}\epsilon_{n}% ^{-i}(t)$
		$\displaystyle=-\bar{c}_{n}(t)\bar{\Delta}_{n}(t)+\bar{\epsilon}_{n}(t).$

Here we have,

	$\displaystyle\bar{c}_{n}(t)$	$\displaystyle=\frac{\frac{1}{n}\sum_{i=1}^{n}c_{n}^{-i}(t)\Delta_{n}^{-i}(t)}{% \bar{\Delta}_{n}(t)}$
		$\displaystyle=\frac{\frac{1}{n}\sum_{i=1}^{n}\nabla\ell(w,z_{i})\cdot\nabla% \bar{\ell}(w,S_{n}^{-i})\big{\|}_{w_{n}(t)}^{w_{n}^{-i}(t)}}{\bar{\Delta}_{n}(t% )},$

and

	$\displaystyle\bar{\epsilon}_{n}(t)$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\epsilon_{n}^{-i}(t)$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nabla\ell_{i}^{\top}\left(\frac{1}{n}% \sum_{j=1}^{n}\nabla\ell_{j}-\frac{1}{n-1}\sum_{j\neq i}\nabla\ell_{j}\right)$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nabla\ell_{i}^{\top}\left(\frac{1}{n}% \nabla\ell_{i}-\frac{1}{n(n-1)}\sum_{j\neq i}\nabla\ell_{j}\right)$
		$\displaystyle=\frac{1}{n^{2}}\sum_{i=1}^{n}\nabla\ell_{i}^{\top}\nabla\ell_{i}% -\frac{1}{n^{2}(n-1)}\sum_{i\neq j}\nabla\ell_{i}^{\top}\nabla\ell_{j}.$

In the calculation above, we use $\nabla\ell_{i}$ , $\nabla\bar{\ell}$ as an abbreviation for $\nabla\ell(w_{n}(t),z_{i})$ , $\nabla\bar{\ell}(w_{n}(t),S_{n})$ respectively. Notice that we have the following decomposition of the gradient covariance matrix $\hat{\Sigma}(t)$ :

	$\displaystyle\hat{\Sigma}(t)$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left(\nabla\ell_{i}-\nabla\bar{\ell}% \right)\left(\nabla\ell_{i}-\nabla\bar{\ell}\right)^{\top}$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nabla\ell_{i}\nabla\ell_{i}^{\top}-% \frac{1}{n^{2}}\left(\sum_{i=1}^{n}\nabla\ell_{i}\right)\left(\sum_{i=1}^{n}% \nabla\ell_{i}\right)^{\top}$
		$\displaystyle=\frac{n-1}{n^{2}}\sum_{i=1}^{n}\nabla\ell_{i}\nabla\ell_{i}^{% \top}-\frac{1}{n^{2}}\sum_{i\neq j}\nabla\ell_{i}\nabla\ell_{j}^{\top}$

Hence, we have

\displaystyle\bar{\epsilon}_{n}(t)=\frac{\tr\hat{\Sigma}(t)}{n-1},\quad\hat{% \Sigma}_{n}(t)=\operatorname*{\text{Cov}}_{z\sim\text{Unif}(S_{n})}\nabla\ell(% w_{n}(t),z),

where $\hat{\Sigma}_{n}(t)$ represents the covariance matrix of $\nabla\ell(w_{n}(t),z)$ for $z$ sampled uniformly from the dataset $S_{n}$ .

Evolution of $\operatorname*{\mathbb{E}}\left[\bar{\Delta}_{n}(t)\right]$ :

A modified version of $\bar{c}_{n}$ and $\bar{\epsilon}_{n}$ ,

\displaystyle\bar{c}_{n}=\frac{\operatorname*{\mathbb{E}}\left[\frac{1}{n}\sum% _{i=1}^{n}\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})\big{|}_{w_{n}% (t)}^{w_{n}^{-i}(t)}\right]}{\operatorname*{\mathbb{E}}\left[\bar{\Delta}_{n}(% t)\right]},\quad\bar{\epsilon}_{n}=\frac{\operatorname*{\mathbb{E}}\left[\tr% \hat{\Sigma}(t)\right]}{n-1},

gives the evolution of $\mathbb{E}\left[\bar{\Delta}_{n}(t)\right]$ ,

\displaystyle\derivative{\mathbb{E}\left[\bar{\Delta}_{n}(t)\right]}{t}=-\bar{% c}_{n}(t)\mathbb{E}\left[\bar{\Delta}_{n}(t)\right]+\bar{\epsilon}_{n}(t).

A.4 Evolution of $\vec{r}_{n}(t)$

We derive the equation governing the evolution of $\vec{r}_{n}(t)$ by calculating its time derivative.

	$\displaystyle\frac{\differential\vec{r}_{n}(t)}{\differential t}$	$\displaystyle=\frac{1}{\sqrt{n}}\begin{bmatrix}\derivative{r(w_{n}(t),z_{1})}{% t}\\ ...\\ \derivative{r(w_{n}(t),z_{n})}{t}\end{bmatrix}=\frac{1}{\sqrt{n}}\begin{% bmatrix}\nabla r(w_{n}(t),z_{1})^{\top}\derivative{w_{n}(t)}{t}\\ ...\\ \nabla r(w_{n}(t),z_{n})^{\top}\derivative{w_{n}(t)}{t}\end{bmatrix}$
		$\displaystyle=\frac{1}{\sqrt{n}}\begin{bmatrix}-\frac{1}{n}\sum_{j=1}^{n}% \nabla r(w_{n}(t),z_{1})^{\top}\nabla f(w_{n}(t),x_{j})\cdot r(w_{n}(t),z_{j})% \\ ...\\ -\frac{1}{n}\sum_{j=1}^{n}\nabla r(w_{n}(t),z_{n})^{\top}\nabla f(w_{n}(t),x_{% j})\cdot r(w_{n}(t),z_{j})\end{bmatrix}=-\frac{1}{n}P_{n}(t)\vec{r}_{n}(t).$

Here the third equality follows from the evolution of $w_{n}(t)$ :

	$\displaystyle\derivative{w_{n}(t)}{t}$	$\displaystyle=-\nabla\ell(w_{n}(t),S_{n})$
		$\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\nabla f(w_{n}(t),x_{i})\derivative{% \ell(f(w_{n}(t),x_{i}),y_{i})}{f(w_{n}(t),x_{i})}$
		$\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\nabla f(w_{n}(t),x_{i})r(w_{n}(t),x_{% i}).$

Note that

	$\displaystyle\nabla r(w,z_{i}),\nabla f(w,z_{i})$	$\displaystyle\in\mathcal{W}\times\mathcal{Y}$
	$\displaystyle P_{n}(t)=\left[\nabla r(w_{n}(t),z_{i})^{\top}\nabla f(w_{n}(t),% x_{j})\right]_{i,j\in[n]}$	$\displaystyle\in\mathcal{Y}^{n}\times\mathcal{Y}^{n}.$

A.5 Proof of Lemma 10

We have the following decomposition of the gradient covariance $\hat{\Sigma}_{n}(t)$ :

	$\displaystyle\hat{\Sigma}_{n}(t)$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left(\nabla\ell_{i}-\nabla\bar{\ell}% \right)\left(\nabla\ell_{i}-\nabla\bar{\ell}\right)^{\top}$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nabla\ell_{i}\nabla\ell_{i}^{\top}-% \nabla\bar{\ell}\nabla\bar{\ell}^{\top}$

where

	$\displaystyle\nabla\ell_{i}$	$\displaystyle=\nabla f(w_{n}(t),x_{i})r_{i}(t)$
	$\displaystyle\nabla\bar{\ell}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nabla f(w_{n}(t),x_{i})r_{i}(t).$

Hence, we have

	$\displaystyle\hat{\Sigma}_{n}(t)$	$\displaystyle=\vec{r}_{n}(t)^{\top}M_{n}(t)\vec{r}_{n}(t)-\frac{1}{n}\vec{r}_{% n}(t)^{\top}H_{n}(t)\vec{r}_{n}(t)$
		$\displaystyle=\vec{r}_{n}(t)^{\top}\left(M_{n}(t)-\frac{H_{n}(t)}{n}\right)% \vec{r}_{n}(t).$

where

	$\displaystyle M_{n}(t)$	$\displaystyle=\scalebox{1.0}{\mbox{$\displaystyle\operatorname{\text{diag}}% \left(\nabla f(w,x_{1})^{\top}\nabla f(w,x_{1}),\dots,\nabla f(w,x_{n})^{\top}% \nabla f(w,x_{n})\right)\bigg{\|}_{w_{n}(t)}$}},$
	$\displaystyle H_{n}(t)$	$\displaystyle=\left[\nabla f(w_{n}(t),x_{i})^{\top}\nabla f(w_{n}(t),x_{j})% \right]_{i,j\in[n]}.$

A.6 Proof of Theorem 13

By Eq. 13 and Lemma 10,

\tr\hat{\Sigma}_{n}(t)=\vec{r}_{n}(0)^{\top}\Omega_{n}(t)^{\top}\left(M_{n}(t)% -\frac{H_{n}(t)}{n}\right)\Omega_{n}(t)\vec{r}_{n}(0).

(20)

Combining with the solution of $\bar{\Delta}_{n}(t)$ Eq. 7, we have

\displaystyle\bar{\Delta}_{n}(t)=\vec{r}_{n}(0)^{\top}\left(\frac{\int_{0}^{t}% \Omega_{n}(s)^{\top}\left(M_{n}(s)-\frac{H_{n}(s)}{n}\right)\Omega_{n}(s)\exp% \left(-\int_{s}^{t}\bar{c}_{n}(u)\differential u\right)\differential s}{n-1}% \right)\vec{r}_{n}(0)

Hence, we have

\displaystyle\bar{\Delta}_{n}(t)=\vec{r}_{n}(0)^{\top}K_{n}(0,t)\vec{r}_{n}(0)

where

\displaystyle K_{n}(0,t)=\frac{\int_{0}^{t}\Omega_{n}(s)^{\top}\left(M_{n}(s)-% \frac{H_{n}(s)}{n}\right)\Omega_{n}(s)\exp\left(-\int_{s}^{t}\bar{c}_{n}(u)% \differential u\right)\differential s}{n-1}.

Now we prove the positive semi-definiteness (PSD) of $K_{n}(0,t)$ by showing that $M_{n}(s)-H_{n}(s)/n$ is PSD. For any vector $r\in\mathcal{Y}^{n}$ , rewrite $r$ as $r=[r_{1},\dots,r_{n}]$ , where $r_{i}\in\mathcal{Y}$ for all $i\in[n]$ . Then $r^{\top}(M_{n}-H_{n}/n)r$ is the trace of covariance of the set of vectors $\{\nabla f(w,x_{i})r_{i}\}_{i\in[n]}$ , hence non-negative, which implies that $M_{n}(s)-H_{n}(s)/n$ is PSD for all $s$ . Hence, the matrix $K_{n}(0,t)$ is PSD as an integral of PSD matrices.

Proof of Lemma 14.

Let $\sigma_{\max}(t)$ be the largest singular value of the propagator $\Omega_{n}(t)$ and $u(t)$ and $v(t)$ be its corresponding left and right singular vectors respectively, i.e.,

u(t)^{\top}\Omega(t)v(t)=\sigma_{\max}(t),

where $u(t)^{\top}u(t)=v(t)^{\top}v(t)=1$ . We first give a bound on $\sigma_{\max}(t)$ through its evolution.

	$\displaystyle\derivative{\sigma^{2}_{\max}(t)}{t}$	$\displaystyle=v(t)^{\top}\derivative{(\Omega^{\top}(t)\Omega(t))}{t}v(t)+2% \derivative{v(t)^{\top}}{t}\Omega^{\top}(t)\Omega(t)v(t)$
		$\displaystyle=v(t)^{\top}\derivative{(\Omega^{\top}(t)\Omega(t))}{t}v(t)$
		$\displaystyle=-\frac{1}{n}v(t)^{\top}\Omega(t)^{\top}(P_{n}(t)+P_{n}(t)^{\top}% )\Omega(t)v(t)$
		$\displaystyle=-\frac{\sigma^{2}_{\max}(t)}{n}u(t)^{\top}(P_{n}(t)+P_{n}(t)^{% \top})u(t)$
		$\displaystyle\leq-\frac{2\sigma^{2}_{\max}(t)\lambda_{\min}(t)}{n}.$

In the second equality, since $\Omega(t)v(t)=\sigma_{\max}(t)u(t)$ , we have

\displaystyle\derivative{v(t)^{\top}}{t}\Omega^{\top}(t)\Omega(t)v(t)=\sigma^{% 2}_{\max}(t)\derivative{v(t)^{\top}}{t}v(t)=\frac{1}{2}\sigma^{2}_{\max}(t)% \derivative{(v(t)^{\top}v(t))}{t}=0.

The third equality follows from the evolution of the propagator $\differential\Omega(t)/\differential t=-P_{n}(t)\Omega(t)/n$ . Note that $\Omega(0)=I$ , which implies that $\sigma_{\max}(0)=1$ , hence we have

\displaystyle\sigma^{2}_{\max}(t)\leq\exp\left(-2\int_{0}^{t}\frac{\lambda_{% \min}(s)}{n}\differential s\right).

Let $A(t)=\frac{\Omega_{n}(t)^{\top}(M_{n}(t)-H_{n}(t)/n)\Omega_{n}(t)}{n-1}$ , $\tilde{c}(s,t)=\exp\left(-\int_{s}^{t}\bar{c}(u)\differential u\right)$ . Then $\norm{A(t)}_{2}\leq\sigma^{2}_{\max}(t)m(t)\leq\omega(t)m(t)$ . Hence we have,

	$\displaystyle\norm{K_{n}(0,t_{2})-K_{n}(0,t_{1})}_{2}$	$\displaystyle=\norm{\int_{0}^{t_{1}}A(s)\left(\tilde{c}(s,t_{2})-\tilde{c}(s,t% _{1})\right)\differential s+\int_{t_{1}}^{t_{2}}A(s)\tilde{c}(s,t_{2})% \differential s}_{2}$
		$\displaystyle\leq\int_{0}^{t_{1}}\norm{A(s)\left(\tilde{c}(s,t_{2})-\tilde{c}(% s,t_{1})\right)}_{2}\differential s+\int_{t_{1}}^{t_{2}}\norm{A(s)\tilde{c}(s,% t_{2})}_{2}\differential s$
		$\displaystyle=\int_{0}^{t_{1}}\omega(s)m(s)\left(\tilde{c}(s,t_{1})-\tilde{c}(% s,t_{2})\right)\differential s+\int_{t_{1}}^{t_{2}}\omega(s)m(s)\tilde{c}(s,t_% {2})\differential s$

For the first term, $|\omega(s)m(s)(\tilde{c}(s,t_{2})-\tilde{c}(s,t_{1}))|\leq 2\omega(s)m(s)$ . By the integrability of $\omega(s)m(s)$ , and the dominated convergence theorem (DCT),

\displaystyle\lim_{t_{1},t_{2}\to\infty}\int_{0}^{t_{1}}\omega(s)m(s)(\tilde{c% }(s,t_{1})-\tilde{c}(s,t_{2}))\differential s=\int_{0}^{\infty}\lim_{t_{1},t_{% 2}\to\infty}\omega(s)m(s)(\tilde{c}(s,t_{1})-\tilde{c}(s,t_{2}))1_{[0,t_{1}]}(% s)\differential s=0.

Note that the existence of $\lim_{t\to\infty}\tilde{c}(s,t)$ , which is guaranteed by $\bar{c}_{n}(t)\geq 0$ , and the uniform boundedness of $\omega(t)m(t)$ , indicates that the limit of the product function being 0 in the second equality. For the second term,

\displaystyle\int_{t_{1}}^{t_{2}}\omega(s)m(s)\tilde{c}(s,t_{2})\differential s% \leq\int_{t_{1}}^{t_{2}}\omega(s)m(s)\differential s\to 0

as $t_{1},t_{2}\to\infty$ by condition (1). Hence, $\norm{K_{n}(0,t_{2})-K_{n}(0,t_{1})}_{2}\to 0$ as $t_{1},t_{2}\to 0$ , which shows the existence of $\lim_{t\to\infty}K_{n}(0,t)$ in 2-norm of matrix.

∎

Remark 24.

Sometimes the effective Gram matrix calculated from the propagator derived from $P_{n}(t)$ is not convergent, but in this case, we can create a perturbed version of $P_{n}(t)$ with controlled smallest eigenvalue of $(P_{n}(t)+P_{n}(t)^{\top})/2$ , which guarantees the convergence of $\lim_{t\to\infty}K_{n}(0,t)$ while preserving the trajectory of $\vec{r}_{n}(t)$ given $\vec{r}_{0}(t)$ . For example, in Section 3.5 we construct $P_{n}^{\varepsilon}(t)$ as a perturbed version of $P_{n}(t)$ .

A.7 Calculations for the regression example

The gradient for the averaged loss $\bar{\ell}(w,S_{n})$ and $\bar{\ell}(w,S_{n}^{-i})$ are

	$\displaystyle\nabla\bar{\ell}(w,S_{n})$	$\displaystyle=\frac{1}{2}\left(w^{\top}x_{1}-y_{1}\right)x_{1}+\frac{1}{2}% \left(w^{\top}x_{2}-y_{2}\right)x_{2}$
	$\displaystyle\nabla\bar{\ell}(w,S_{n}^{-1})$	$\displaystyle=\frac{n-2}{2(n-1)}\left(w^{\top}x_{1}-y_{1}\right)x_{1}+\frac{n}% {2(n-1)}\left(w^{\top}x_{2}-y_{2}\right)x_{2}$
	$\displaystyle\nabla\bar{\ell}(w,S_{n}^{-2})$	$\displaystyle=\frac{n}{2(n-1)}\left(w^{\top}x_{1}-y_{1}\right)x_{1}+\frac{n-2}% {2(n-1)}\left(w^{\top}x_{2}-y_{2}\right)x_{2}.$

The averaged contraction factor is

	$\displaystyle\bar{c}_{n}(t)$	$\displaystyle=\frac{\frac{1}{n}\sum_{i=1}^{n}\nabla\ell(w,z_{i})\cdot\nabla% \bar{\ell}(w,S_{n}^{-i})\big{\|}^{w_{n}^{-i}(t)}_{w_{n}(t)}}{\bar{\Delta}_{n}(t)}$
		$\displaystyle=\frac{\frac{1}{2}\left(\nabla\ell(w,z_{1})\cdot\nabla\bar{\ell}(% w,S_{n}^{-1})\big{\|}^{w_{n}^{-1}(t)}_{w_{n}(t)}+\nabla\ell(w,z_{2})\cdot\nabla% \bar{\ell}(w,S_{n}^{-2})\big{\|}^{w_{n}^{-2}(t)}_{w_{n}(t)}\right)}{\frac{1}{2}% \left(\frac{1}{2}(w^{\top}x_{1}-y_{1})^{2}\big{\|}^{w_{n}^{-1}(t)}_{w_{n}(t)}+% \frac{1}{2}(w^{\top}x_{2}-y_{2})^{2}\big{\|}^{w_{n}^{-2}(t)}_{w_{n}(t)}\right)}$
		$\displaystyle=\frac{\frac{1}{2}\left(\frac{n-2}{2(n-1)}(w^{\top}x_{1}-y_{1})^{% 2}\big{\|}^{w_{n}^{-1}(t)}_{w_{n}(t)}+\frac{n-2}{2(n-1)}(w^{\top}x_{2}-y_{2})^{% 2}\big{\|}^{w_{n}^{-2}(t)}_{w_{n}(t)}\right)}{\frac{1}{2}\left(\frac{1}{2}(w^{% \top}x_{1}-y_{1})^{2}\big{\|}^{w_{n}^{-1}(t)}_{w_{n}(t)}+\frac{1}{2}(w^{\top}x_% {2}-y_{2})^{2}\big{\|}^{w_{n}^{-2}(t)}_{w_{n}(t)}\right)}=\frac{n-2}{2(n-1)}.$

The propagator $\Omega_{n}^{\varepsilon}(t)$ of the evolution $\differential\vec{r}_{n}(t)/\differential t=-P_{n}^{\varepsilon}(t)\vec{r}_{n}% (t)/n$ is

\displaystyle\Omega_{n}^{\varepsilon}(t)=\exp\left(-\frac{\int_{0}^{t}P_{n}^{% \varepsilon}(s)\differential s}{n}\right)=U\exp\left(-\frac{\int_{0}^{t}% \Lambda^{\varepsilon}(s)\differential s}{n}\right)U^{\top}.

Hence, the effective metric $K_{n}(0,t)$ can be calculated easily as

	$\displaystyle K_{n}(0,t)$	$\displaystyle=\frac{\int_{0}^{t}U\exp\left(-\frac{2\int_{0}^{s}\Lambda^{% \varepsilon}(u)\differential u}{n}\right)\left(I-\frac{\Lambda^{\varepsilon}(s% )}{n}\right)U^{\top}\exp(-(t-s)\bar{c})\differential s}{n-1}$
		$\displaystyle=U\Lambda^{K}(t)U^{\top}$

where $\Lambda^{K}(t)=[\lambda^{K}_{1}(t),\dots,\lambda^{K}_{n}(t)]$ , and we have

	$\displaystyle\lambda^{K}_{1}(t)=\lambda^{K}_{2}(t)=\lambda(t)$	$\displaystyle:=\frac{(1-\bar{c})^{-1}/2}{n-1}\left(\exp(-\bar{c}t)-\exp(t)% \right),$
	$\displaystyle\lambda^{K}_{3}(t)=\dots=\lambda^{K}_{n}(t)=\lambda^{\prime}(t)$	$\displaystyle:=\frac{\int_{0}^{t}\exp\left(-\int_{0}^{s}\varepsilon(u)% \differential u\right)\cdot\exp\left(-(t-s)\bar{c}\right)\cdot\left(1-\frac{% \varepsilon(s)}{2}\right)\differential s}{n-1}$

For $\varepsilon(t)=\bar{\varepsilon}\left(1_{[0,1]}(t)+1_{[1,\infty]}(t)/t^{2}\right)$ , $\exp\left(-\int_{0}^{s}\varepsilon(u)\differential u\right)\in[\exp(-2\bar{% \varepsilon}),1]$ , $1-\varepsilon(s)/2\in[1-\bar{\varepsilon}/2,1]$ , hence, for $\bar{\varepsilon}$ small enough, $\frac{1-\exp\left(-\bar{c}t\right)}{2\bar{c}(n-1)}\leq\lambda^{\prime}(t)\leq% \frac{1-\exp\left(-\bar{c}t\right)}{\bar{c}(n-1)}$ . Hence, we have $\lambda(t)=\Theta(\exp(-\bar{c}t))$ , $\lambda^{\prime}(t)=\Theta(1)$ .

Here we also calculate the solution of $w_{n}(t)$ and $w_{n}^{-i}(t)$ although not required in the derivation of the effective gram matrix.

	$\displaystyle w_{n}(t)$	$\displaystyle=\int_{0}^{t}\exp\left(-(t-s)\left(\frac{1}{2}x_{1}x_{1}^{\top}+% \frac{1}{2}x_{2}x_{2}^{\top}\right)\right)\left(\frac{1}{2}y_{1}x_{1}+\frac{1}% {2}y_{2}x_{2}\right)ds$
		$\displaystyle=\left(1+\exp(-t/2)\right)\cdot(y_{1}x_{1})+\left(1+\exp(-t/2)% \right)\cdot(y_{2}x_{2})$

Similarly, we have

\displaystyle w^{-i}_{n}(t)

\displaystyle=\left(1-\exp(-\frac{n-2}{2(n-1)}t)\right)\cdot(y_{k}x_{k})+\left% (1-\exp(-\frac{n}{2(n-1)}t)\right)\cdot(y_{l}x_{l})

where $k=\lceil 2i/n\rceil$ , $l\in\{1,2\}/\{k\}$ . We can see that when trained on the dataset $S_{n}^{-i}$ , the progress on the direction $x_{\lceil 2i/n\rceil}$ is slightly less than the other direction, which introduces the non-zero averaged loss difference $\bar{\Delta}_{n}(t)$ during training.

A.8 Approximation of the contraction factor $\bar{c}_{n}$

In this section, we analyze the averaged version of the batch-wise contraction factor as introduced in Section B.1. By the fundamental theorem of calculus, we have the following expression of the numerator and denominator of the averaged contraction factor $\bar{c}_{n}$ .

		$\displaystyle\nabla\bar{\ell}(w,S_{(m)})\cdot\nabla\bar{\ell}(w,S_{n}^{-(m)})% \big{\|}^{w_{n}^{-(m)}(t)}_{w_{n}(t)}$
	$\displaystyle=$	$\displaystyle\int_{0}^{1}\left(\nabla\bar{\ell}(w,S_{n}^{-(m)})\nabla^{2}\bar{% \ell}(w,S_{(m)})+\nabla\bar{\ell}(w,S_{(m)})\nabla^{2}\bar{\ell}(w,S_{n}^{-(m)% })\big{\|}_{w=h(u)}\right)\cdot\left(w_{n}^{-(m)}(t)-w_{n}(t)\right)% \differential u,$

\displaystyle\bar{\Delta}_{n}^{-(m)}(t)=\bar{\ell}(w,S_{(m)})\big{|}^{w_{n}^{-% (m)}(t)}_{w_{n}(t)}=\int_{0}^{1}\nabla\bar{\ell}(w,S_{(m)})\big{|}_{w=h(u)}% \cdot\left(w_{n}^{-(m)}(t)-w_{n}(t)\right)\differential u.

where $h(u)=w_{n}(t)+u(w_{n}^{-(m)}(t)-w_{n}(t))$ , $u\in[0,1]$ is the line segment intersecting $w_{n}(t)$ and $w_{n}^{-(m)}(t)$ .

We approximate $\nabla\bar{\ell}(w,S_{(m)})$ , $\nabla\bar{\ell}(w,S_{(m)})$ by $\nabla\bar{\ell}(w,S_{n})$ , approximate $\nabla^{2}\bar{\ell}(w,S_{(m)})$ , $\nabla^{2}\bar{\ell}(w,S_{(m)})$ by $\nabla^{2}\bar{\ell}(w,S_{n})$ . We approximate the integral by the value at the point $u=0$ , where $h(u)=w_{n}(t)$ , then we have the following approximation of $\bar{c}_{n}$ .

\displaystyle\bar{c}_{n}(t)\approx\frac{\nabla\bar{\ell}(w_{n}(t),S_{n})^{\top% }\nabla^{2}\bar{\ell}(w_{n}(t),S_{n})\mathbb{E}_{(m)}\left[w_{n}^{-(m)}(t)-w_{% n}(t)\right]}{\nabla\bar{\ell}(w_{n}(t),S_{n})^{\top}\mathbb{E}_{(m)}\left[w_{% n}^{-(m)}(t)-w_{n}(t)\right]}.

Appendix B Experimental Details

Dataset

We use the MNIST and CIFAR10 datasets for experiments in Section 4. We do experiments on 10-classes and 2-classes (we select classes 0,3) problems on both MNIST and CIFAR10 (denoted as MNIST-10, MNIST-2, CIFAR-10, CIFAR-2 respectively), and 5-classes (we select classes 0,1,2,3,4) problem on MNIST (denoted as MNIST0-5). For all experiments, we choose $n/m=10$ .

Architectures

We use LeNet-5 (a network with two convolutional layers of 20 and 50 channels respectively, both of 5×5 kernel size, and a fully-connected layer with 500 hidden neurons), LeNeT-5-GS (the original LeNeT-5 with an additional gray-scale layer), WRN-4-4 (wide residual network with 4 layers and a widening factor of 4, the batch normalization layers are all replaced with layer normalization layers Ba et al. (2016)) and FC (two layer fully-connected net) for training, We use two layer fully-connected net for synthetic data generation.

Synthetic data generation

We created two types of synthetic datasets: 1) Datasets Syn- $(a,b)$ is created by modifying the labeling regime of MNIST dataset.

•

Approximate the second moment matrix of input $\mathbb{E}[xx^{\top}]$ by its empirical version $X^{\top}X/n$ calculated by 10000 samples from the original MNIST training set.
•

Eigenvalue decomposition of the empirical second moment matrix $X^{\top}X/n=Q\operatorname{\text{diag}}(L)Q^{\top}$ , where $L$ denotes the eigen spectrum sorted from the largest to the smallest.
•

Project the input of training set (except for the samples used for calculating empirical second moment matrix) and validation set of MNIST onto $Q_{a:b}$ . Whiten each pixel of the projection.
•

Relabel the original input by a teacher network with random weights applied to the projected input.

2) Datasets Gaussian- $\alpha$ is created with Gaussian data with different covariance matrices, labeled by a teacher network with random weights.

•

Create covariance matrix $A$ with $i$ -th eigenvalue being $\exp(-\alpha i)$ . The eigenvalue decomposition of $A$ is $A=Q\operatorname{\text{diag}}(L)Q^{\top}$ .
•

Sample the input from the multivariate Gaussian distribution $N(0,A)$ .
•

Project the input onto $Q_{1:10}$ . Whiten each elements of the projection.
•

Label the original input by a teacher network with random weights applied to the projected input.

3) Dataset MNIST(random label) is created by randomly assigning labels to the original MNIST inputs, according to a uniform distribution on the ten classes {0,1,2,3,4,5,6,7,8,9}.

B.1 Contraction and perturbation factors for the omitting- $m$ -samples setting in Section 4

In the omitting $m$ -samples setting, by similar calculations as in Appendix A, the batch-wise contraction and perturbation factors are

\displaystyle c_{n}^{-(m)}(t)=\frac{\nabla\bar{\ell}(w,S_{(m)})\cdot\nabla\bar% {\ell}(w,S_{n}^{-(m)})\big{|}^{w_{n}^{-(m)}(t)}_{w_{n}(t)}}{\Delta_{n}^{-(m)}(% t)},

\displaystyle\epsilon_{n}^{-(m)}(t)=\nabla\bar{\ell}(w,S_{(m)})\cdot\left(% \nabla\bar{\ell}(w,S_{n})-\nabla\bar{\ell}(w,S_{n}^{-(m)})\right)\bigg{|}_{w_{% n}(t)}.

The averaged contraction and perturbation factors are

\displaystyle\bar{c}_{n}(t)=\frac{\operatorname*{\mathbb{E}}_{(m)}\left[\nabla% \bar{\ell}(w,S_{(m)})\cdot\nabla\bar{\ell}(w,S_{n}^{-(m)})\big{|}^{w_{n}^{-(m)% }(t)}_{w_{n}(t)}\right]}{\bar{\Delta}_{n}(t)},

\displaystyle\bar{\epsilon}_{n}(t)=\frac{\tr\hat{\Sigma}(t)}{n-1},\quad\hat{% \Sigma}_{n}(t)=\operatorname*{\text{Cov}}_{z\sim\text{Unif}(S_{n})}\nabla\ell(% w_{n}(t),z),

where $\hat{\Sigma}_{n}(t)$ represents the covariance matrix of $\nabla\ell(w_{n}(t),z)$ for $z$ sampled uniformly from the dataset $S_{n}$ . Note that the averaged contraction factor $\bar{\epsilon}_{n}(t)$ for removed- $m$ -samples settings are the same for different $m$ ’s.

B.2 The analysis of the increment of averaged loss difference $\bar{\Delta}_{n}(t)-\bar{\Delta}_{n}(t_{0})$ .

In this section, we consider the training process starting from time $t_{0}$ . Different trajectories $w_{n}^{-(m)}(\cdot)$ and $w_{n}(\cdot)$ are different at time $t_{0}$ , so the batchwise loss difference $\Delta^{-(m)}_{n}(t_{0})$ and averaged loss difference $\bar{\Delta}_{n}(t_{0})$ are nonzero in general. We now consider the increment $\bar{\Delta}_{n}(t)-\bar{\Delta}_{n}(t_{0})$ for $t>t_{0}$ . The evolution of $\bar{\Delta}_{n}(t)-\bar{\Delta}_{n}(t_{0})$ is,

\displaystyle\derivative{\left(\bar{\Delta}_{n}(t)-\bar{\Delta}_{n}(t_{0})% \right)}{t}=-\bar{c}_{n}(t)\left(\bar{\Delta}_{n}(t)-\bar{\Delta}_{n}(t_{0})% \right)+\bar{\epsilon}_{n}(t),

where by revising the denominator in Eq. 8 and Eq. 9, we have

\displaystyle\bar{c}_{n}(t)=\frac{\operatorname*{\mathbb{E}}_{(m)}\left[\nabla% \bar{\ell}(w,S_{(m)})\cdot\nabla\bar{\ell}(w,S_{n}^{-(m)})\big{|}^{w_{n}^{-(m)% }(t)}_{w_{n}(t)}\right]}{\bar{\Delta}_{n}(t)-\bar{\Delta}_{n}(t_{0})},

\displaystyle\bar{\epsilon}_{n}(t)=\frac{\tr\hat{\Sigma}(t)}{n-1},\quad\hat{% \Sigma}_{n}(t)=\operatorname*{\text{Cov}}_{z\sim\text{Unif}(S_{n})}\nabla\ell(% w_{n}(t),z).

The evolution of the residual starting from $t_{0}$ is

\displaystyle\vec{r}_{n}(t)=\Omega_{n}(t_{0},t)\vec{r}_{n}(t_{0}).

Combining with Lemma 10, we have the following decomposition for the covariance trace,

\displaystyle\tr\hat{\Sigma}_{n}(t)=\frac{1}{n}\vec{r}_{n}(t_{0})^{\top}\Omega% _{n}(t_{0},t)^{\top}\left(M_{n}(t)-\frac{H_{n}(t)}{n}\right)\Omega_{n}(t_{0},t% )\vec{r}_{n}(t_{0}).

By similar arguments as in Theorem 13, we have the quadratic form expression for the increment of averaged loss difference $\bar{\Delta}_{n}(t)-\bar{\Delta}_{n}(t_{0})$ ,

\displaystyle\bar{\Delta}_{n}(t)-\bar{\Delta}_{n}(t_{0})=\vec{r}_{n}(t_{0})^{% \top}K_{n}(t_{0},t)\vec{r}_{n}(t_{0}).

where

\displaystyle K_{n}(t_{0},t)=\frac{\int_{t_{0}}^{t}\Omega_{n}(t_{0},s)^{\top}% \left(M_{n}(s)-\frac{H_{n}(s)}{n}\right)\Omega_{n}(t_{0},s)\exp\left(-\int_{s}% ^{t}\bar{c}_{n}(u)\differential u\right)\differential s}{n-1}.

Let

K_{n}(t_{0})\triangleq\lim_{t\to\infty}K_{n}(t_{0},t)

when the limit exists, then we have

\displaystyle\bar{\Delta}_{n}(\infty)-\bar{\Delta}_{n}(t_{0})=\vec{r}_{n}(t_{0% })^{\top}K_{n}(t_{0})\vec{r}_{n}(t_{0}),

where $\bar{\Delta}_{n}(\infty):=\lim_{t\to\infty}\bar{\Delta}_{n}(t)$ , and the limit exists. We call $K_{n}(t_{0})$ the effective Gram matrix of a neural network starting from $t_{0}$ .

B.3 The approximation of generalization gap

Table S.1 compares the generalization gaps, averaged loss difference and its approximations for a variety of different architectures and datasets. We can see that in almost all cases, these quantities are very close, indicating that the approximation of averaged loss difference represents well of the generalization gap. The small generalization errors provide guarantees for the quality of the used models. The small training loss shows that the models are trained till near interpolation. The second last column $\bar{\sigma}(K_{n})$ shows the estimates of the kernel magnitude. We can see from the last column of the table that when the same datasets are used, even the number of samples are different, the norm of the initial residual are almost the same (eg. last column of row 2-6 in the table for the results of MNIST-5 with different number of samples), which justifies the idea of normalizing the initial residual by $1/\sqrt{n}$ , and shows that the normalization makes the effective Gram matrix decomposition of datasets with different samples comparable.

Architecture	Dataset	# samples	$\bar{\Delta}_{n}(c,\hat{\epsilon},t)$	$\bar{\Delta}_{n}(c,\epsilon,t)$	$\bar{\Delta}_{n}(t)$	$\delta R(S_{n},t)$	$\delta\bar{R}(S^{-(m)}_{n},t)$	Generalization error	$R_{\text{train}}(S_{n},t)$	$\bar{\sigma}(K_{n})$	$\|\|\vec{r}_{n}(0)\|\|^{2}_{2}$
FC	MNIST-2	1100	0.018	0.018	0.021	0.051	0.053	0.028	0.034	0.595644	0.513084
FC	MNIST-5	55	0.255	0.266	0.271	0.336	0.360	0.134	0.036	7.374783	0.808355
FC	MNIST-5	110	0.221	0.224	0.232	0.247	0.258	0.092	0.038	6.687616	0.820471
FC	MNIST-5	550	0.127	0.128	0.142	0.095	0.101	0.040	0.038	4.664132	0.821945
FC	MNIST-5	1100	0.111	0.112	0.128	0.084	0.090	0.033	0.038	4.567158	0.823081
FC	MNIST-5	2200	0.093	0.089	0.092	0.070	0.076	0.028	0.036	3.780354	0.824114
FC	MNIST-10	1100	0.469	0.472	0.476	0.448	0.462	0.134	0.036	22.119434	0.911234
LENET-5	MNIST-10	1100	0.241	0.243	0.228	0.203	0.221	0.075	0.048	6.582657	0.899971
FC	CIFAR-2	1100	0.346	0.446	0.342	0.420	0.433	0.154	0.037	10.059950	0.513471
LENET-5-GS	CIFAR-2	2200	0.461	0.438	0.461	0.495	0.553	0.140	0.042	5.696342	0.509838
LENET-5	CIFAR-2	1100	0.627	0.642	0.375	0.450	0.504	0.137	0.043	8.446142	0.496051
WRN-4-4	CIFAR-2	1100	0.111	0.147	0.110	0.187	0.204	0.107	0.090	0.535692	0.498821
FC	syn-(1,10)	1100	0.221	0.237	0.212	0.160	0.180	0.084	0.038	4.419841	0.571913
FC	syn-(11,20)	1100	0.234	0.293	0.371	0.368	0.399	0.137	0.039	5.008938	0.512861
FC	syn-(21,30)	1100	0.235	0.391	0.484	0.421	0.440	0.159	0.040	6.834943	0.501674
FC	syn-(31,40)	1100	0.347	0.583	0.546	0.529	0.554	0.192	0.041	8.572351	0.483528
FC	syn-(41,50)	1100	0.373	0.682	0.599	0.549	0.584	0.192	0.037	10.590858	0.520236
FC	MNIST(random label)	55	3.265	3.789	3.276	4.831	4.770	0.902	0.041	55.534283	0.906837
FC	Gaussian-1	1100	0.062	0.057	0.053	0.063	0.068	0.039	0.038	2.711061	0.601203
FC	Gaussian-0.5	1100	0.087	0.115	0.126	0.122	0.135	0.064	0.038	4.457899	0.569315
FC	Gaussian-0.1	1100	0.188	0.198	0.213	0.227	0.239	0.110	0.038	3.605311	0.517485
FC	Gaussian-0.05	1100	0.251	0.259	0.257	0.280	0.295	0.124	0.035	3.497570	0.513735
FC	Gaussian-0.01	1100	0.488	0.502	0.488	0.518	0.533	0.220	0.039	3.283361	0.518636

Table S.1: Statistics of effective Gram matrix approximation for a variety of different architectures and datasets. See Appendix B for the details of the datasets and architectures. See Section 4.1 for the definitions of generalization gaps

\delta R(S_{n},t)

\delta\bar{R}(S_{n}^{-(m)},t)

, averaged loss difference

\bar{\Delta}_{n}(t)

and its approximations

\bar{\Delta}_{n}(c,\hat{\epsilon},t)

\bar{\Delta}_{n}(c,\epsilon,t)

. “Generalization error” in this table refers to the averaged zero-one loss on test dataset.

R_{\text{train}}(S_{n},t)

refers to the training loss on dataset

S_{n}

. For the last two columns,

\bar{\sigma}(K_{n})

refers to the mean of the eigenvalues of effective kernel,

\norm{\vec{r}_{n}(0)}^{2}_{2}

refers to the squared norm of initial residual. In this table, we evaluate all the quantities (except for

\norm{\vec{r}_{n}(0)}^{2}_{2}

) at the end of training.

B.4 Estimation of the propagator $\Omega_{n}(t_{0},t)$

The propagator $\Omega_{n}(t_{0},t)$ plays a big role in evolution of the residual $\vec{r}_{n}(t)$ in Eq. 13 and the effective kernel in $K_{n}$ Eq. 18. We next introduce two different ways of approximating $\Omega_{n}(t_{0},t)$ .

Product approximation

By a discrete approximation of the evolution of $\vec{r}_{n}(t)$ for a small $\eta=o(1)$ , we have $\frac{\vec{r}_{n}(t+\eta)-\vec{r}_{n}(t)}{\eta}=-\frac{1}{n}P_{n}(t)\vec{r}_{n% }(t).$ We can derive a discrete approximation of $\Omega_{n}(t_{0},t)$ for $t=t_{0}+T\eta$ to be

\Omega_{n}(t)\approx\prod_{k=0}^{T-1}\left(I-\frac{\eta}{n}P_{n}(t_{0}+k\eta)% \right),

(21)

with the products taken from the right.

Magnus expansion (Magnus, 1954)

We may write the propagator $\Omega_{n}(t_{0},t)$ using its Lie algebra as $\Omega_{n}(t_{0},t)=\exp\left(\omega_{n}(t_{0},t)\right)$ . When $P_{n}(t)$ does not commute with itself at different times, the Magnus expansion provides a way to write this time-ordered exponential in terms of an infinite series $\omega_{n}(t_{0},t)=\sum_{k=1}^{\infty}\omega_{n}^{k}(t_{0},t)$ where the first two terms are

	$\displaystyle\omega_{n}^{1}(t_{0},t)$	$\displaystyle=-\frac{1}{n}\int_{t_{0}}^{t}P_{n}(t_{1})\differential t_{1},$		(22)
	$\displaystyle\omega_{n}^{2}(t_{0},t)$	$\displaystyle=\frac{1}{2n^{2}}\int_{t_{0}}^{t}\int_{t_{0}}^{t_{1}}[P_{n}(t_{1}% ),P_{n}(t_{2})]\differential t_{2}\differential t_{1},$		(22)

where $[A,B]\equiv AB-BA$ is the commutator of matrices $A$ and $B$ . Note that $\omega_{n}^{2}(t_{0},t)=0$ if $P_{n}(t_{1})$ and $P_{n}(t_{2})$ commute $\forall t_{1},t_{2}>t_{0}$ . Magnus expansion can be approximated by numerical integration.

Fig. S.1 shows the approximation results for $\bar{\epsilon}_{n}$ and $\bar{\Delta}_{n}$ when different approximations of $\Omega_{n}(t_{0},t)$ are used. We can see from the plot that the Magnus expansion gives good approximation when $t$ is small, but the approximations diverge from the true values of $\bar{\epsilon}_{n}$ and $\bar{\Delta}_{n}$ for large $t$ . The second order Magnus expansion is even worse than the first order one, this could be result form the overshooting of $w_{n}^{2}(t_{0},t)$ . The term $[P_{n}(t_{1}),P_{n}(t_{2})]$ being highly oscillatory, and the step size being too large can be possible reasons. In comparison, the product approximation performs well till the end of training. Hence, for all experiments in Section 4, we calculate the effective Gram matrix through product approximation.

	$\displaystyle\frac{\differential\Delta_{n}^{-i}(t)}{\differential t}$	$\displaystyle=\frac{\differential\left(\ell(w_{n}^{-i}(t),z_{i})-\ell(w_{n}(t)% ,z_{i})\right)}{\differential t}$
		$\displaystyle=-\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})\big{\|}_{% w_{n}^{-i}(t)}-\left(-\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n})\big{\|}% _{w_{n}(t)}\right)$
		$\displaystyle=-\left(\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})% \big{\|}_{w_{n}^{-i}(t)}-\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})% \big{\|}_{w_{n}(t)}\right)$
		$\displaystyle+\left(-\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})% \big{\|}_{w_{n}(t)}+\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n})\big{\|}_{w% _{n}(t)}\right)$
		$\displaystyle=-\left(\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})% \big{\|}_{w_{n}^{-i}(t)}-\nabla\ell(w,z_{i})\cdot\nabla\bar{\ell}(w,S_{n}^{-i})% \big{\|}_{w_{n}(t)}\right)$
		$\displaystyle+\nabla\ell(w,z_{i})\left(\nabla\bar{\ell}(w,S_{n})-\nabla\bar{% \ell}(w,S_{n}^{-i})\right)\big{\|}_{w_{n}(t)}$

	$\displaystyle\bar{c}_{n}(t)$	$\displaystyle=\frac{\frac{1}{n}\sum_{i=1}^{n}\nabla\ell(w,z_{i})\cdot\nabla% \bar{\ell}(w,S_{n}^{-i})\big{\|}^{w_{n}^{-i}(t)}_{w_{n}(t)}}{\bar{\Delta}_{n}(t)}$
		$\displaystyle=\frac{\frac{1}{2}\left(\nabla\ell(w,z_{1})\cdot\nabla\bar{\ell}(% w,S_{n}^{-1})\big{\|}^{w_{n}^{-1}(t)}_{w_{n}(t)}+\nabla\ell(w,z_{2})\cdot\nabla% \bar{\ell}(w,S_{n}^{-2})\big{\|}^{w_{n}^{-2}(t)}_{w_{n}(t)}\right)}{\frac{1}{2}% \left(\frac{1}{2}(w^{\top}x_{1}-y_{1})^{2}\big{\|}^{w_{n}^{-1}(t)}_{w_{n}(t)}+% \frac{1}{2}(w^{\top}x_{2}-y_{2})^{2}\big{\|}^{w_{n}^{-2}(t)}_{w_{n}(t)}\right)}$
		$\displaystyle=\frac{\frac{1}{2}\left(\frac{n-2}{2(n-1)}(w^{\top}x_{1}-y_{1})^{% 2}\big{\|}^{w_{n}^{-1}(t)}_{w_{n}(t)}+\frac{n-2}{2(n-1)}(w^{\top}x_{2}-y_{2})^{% 2}\big{\|}^{w_{n}^{-2}(t)}_{w_{n}(t)}\right)}{\frac{1}{2}\left(\frac{1}{2}(w^{% \top}x_{1}-y_{1})^{2}\big{\|}^{w_{n}^{-1}(t)}_{w_{n}(t)}+\frac{1}{2}(w^{\top}x_% {2}-y_{2})^{2}\big{\|}^{w_{n}^{-2}(t)}_{w_{n}(t)}\right)}=\frac{n-2}{2(n-1)}.$

An Effective Gram Matrix Characterizes Generalization in Deep Networks

Abstract

1 Introduction

Probably-approximately correct (PAC) frameworks

Simplified models of deep networks

Non-worst-case generalization bounds

1.1 Contributions

2 Preliminaries

Dataset

Predictor and the loss function

Gradient flow

Generalization gap

2.1 Contraction theory

Theorem 1 (Theorem 2.1 from Tsukamoto et al. (2021)).

Theorem 2 (Theorem 2.3 from Tsukamoto et al. (2021)).

3 Methods

3.1 Approximation of the generalization gap

Lemma 3.

3.2 Evolution of the averaged loss difference

Lemma 4.

Remark 5 (Deviations from classical contraction theory).

Remark 6 (Classical contraction theory with uniform bounds on contraction and perturbation).

Remark 7 (Comparing trajectories in terms of their loss vs. weight space difference).

Remark 8 (Relationship to information theoretic generalization bounds).

Remark 9 (Some intuition on the contraction factor).

3.3 Evolution of the residual and perturbation

Lemma 10.

Remark 11.

Remark 12 (Relationship to weight norm based bound).

3.4 Effective Gram matrix for neural networks

Theorem 13.

Lemma 14.

Remark 15.

Remark 16 (Data and architecture dependent generalization bound).

Remark 17 (Generalization gap of kernel machines).

3.5 An example calculation of the (effective) Gram matrix for linear regression

Remark 18 (Comparison with Arora et al. (2019)).

4 Experimental Validation

Datasets

Architectures

Constructing perturbed datasets

Calculating quantities that pertain to the generalization gap

4.1 Theorem 13 leads to a good approximation of the generalization gap

Remark 19.

4.2 Initial residual lies primarily in the subspace of effective Gram matrix with small eigenvalues

Remark 20 (Comparing the statistics for different numbers of samples).

4.3 As training proceeds, the residual projects more into the principal subspace of the effective Gram matrix

4.4 Effective Gram matrix for different datasets

4.5 Effective Gram matrix for different architectures

4.6 Effective Gram matrix for different number of samples

5 Conclusion

Acknowledgment

References

Appendix A Proofs and Calculations in Section 3

A.1 Proof of Lemma 3

The scenarios when the assumptions in Lemma 3 hold

Definition 21.

Lemma 22.

Remark 23.

A.2 Proof of Lemma 4

A.3 Evolution of Δ¯n⁢(t)subscript¯Δ𝑛𝑡\bar{\Delta}_{n}(t)over¯ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t )

Evolution of 𝔼[Δ¯n⁢(t)]𝔼subscript¯Δ𝑛𝑡\operatorname*{\mathbb{E}}\left[\bar{\Delta}_{n}(t)\right]blackboard_E [ over¯ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ]:

A.4 Evolution of r→n⁢(t)subscript→𝑟𝑛𝑡\vec{r}_{n}(t)over→ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t )

A.5 Proof of Lemma 10

A.6 Proof of Theorem 13

Remark 24.

A.7 Calculations for the regression example

A.8 Approximation of the contraction factor c¯nsubscript¯𝑐𝑛\bar{c}_{n}over¯ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

Appendix B Experimental Details

Dataset

Architectures

Synthetic data generation

B.1 Contraction and perturbation factors for the omitting-m𝑚mitalic_m-samples setting in Section 4

B.3 The approximation of generalization gap

B.4 Estimation of the propagator Ωn⁢(t0,t)subscriptΩ𝑛subscript𝑡0𝑡\Omega_{n}(t_{0},t)roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t )

Product approximation

Magnus expansion (Magnus, 1954)

An Effective Gram Matrix Characterizes Generalization
in Deep Networks

A.3 Evolution of $\bar{\Delta}_{n}(t)$

Evolution of $\operatorname*{\mathbb{E}}\left[\bar{\Delta}_{n}(t)\right]$ :

A.4 Evolution of $\vec{r}_{n}(t)$

A.8 Approximation of the contraction factor $\bar{c}_{n}$

B.1 Contraction and perturbation factors for the omitting- $m$ -samples setting in Section 4

B.4 Estimation of the propagator $\Omega_{n}(t_{0},t)$