Understanding LLMs Requires More Than Statistical Generalization Patrik Reizinger, Szilvia Ujváry, Anna Mészáros, Anna Kerekes, Wieland Brendel, Ferenc Huszár Abstract The last decade has seen blossoming research in deep learning theory attempting to answer, "Why does deep learning generalize?" A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs are not a consequence of good statistical generalization and require a separate theoretical explanation. Our core argument relies on the observation that AR probabilistic models are inherently non-identifiable: models zero or near-zero KL divergence apart -- thus, equivalent test loss -- can exhibit markedly different behaviors. We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability. We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases. 👉 https://lnkd.in/dMm27N7a #machinelearning
Antonio Montano 🪄’s Post
More Relevant Posts
-
💥💥💥 Understanding Visual Feature Reliance through the Lens of Complexity Thomas Fel, Louis Bethune, Andrew Kyle Lampinen, Thomas Serre, Katherine Hermann Abstract Recent studies suggest that deep learning models inductive bias towards favoring simpler features may be one of the sources of shortcut learning. Yet, there has been limited focus on understanding the complexity of the myriad features that models learn. In this work, we introduce a new metric for quantifying feature complexity, based on V-information and capturing whether a feature requires complex computational transformations to be extracted. Using this V-information metric, we analyze the complexities of 10,000 features, represented as directions in the penultimate layer, that were extracted from a standard ImageNet-trained vision model. Our study addresses four key questions: First, we ask what features look like as a function of complexity and find a spectrum of simple to complex features present within the model. Second, we ask when features are learned during training. We find that simpler features dominate early in training, and more complex features emerge gradually. Third, we investigate where within the network simple and complex features flow, and find that simpler features tend to bypass the visual hierarchy via residual connections. Fourth, we explore the connection between features complexity and their importance in driving the networks decision. We find that complex features tend to be less important. Surprisingly, important features become accessible at earlier layers during training, like a sedimentation process, allowing the model to build upon these foundational elements. 👉 https://lnkd.in/dYZmRehm #machinelearning
To view or add a comment, sign in
-
Unlocking Smarter Training: Insights from the Telescoping Model The telescoping model opens up new possibilities for refining neural network training beyond traditional metrics like loss. This approach provides dynamic feedback throughout training, giving insights into complexity, stability, and the evolving capacity of a model to generalize. Inspired by these ideas, I envision a fully automated training framework that adapts not only learning rates but also optimization strategies, regularization, and batch sizes—guided by real-time, data-driven signals. This would be a leap from simple schedulers toward a smarter, feedback-driven system, finely tuning parameters based on true learning dynamics. This paper sparks a vision for precision training automation in deep learning. https://lnkd.in/ekQpAjcg
To view or add a comment, sign in
-
💥💥💥 Rethinking Deep Thinking: Stable Learning of Algorithms using Lipschitz Constraints Jay Bear, Adam Prügel-Bennett, Jonathon Hare Abstract Iterative algorithms solve problems by taking steps until a solution is reached. Models in the form of Deep Thinking (DT) networks have been demonstrated to learn iterative algorithms in a way that can scale to different sized problems at inference time using recurrent computation and convolutions. However, they are often unstable during training, and have no guarantees of convergence/termination at the solution. This paper addresses the problem of instability by analyzing the growth in intermediate representations, allowing us to build models (referred to as Deep Thinking with Lipschitz Constraints (DT-L)) with many fewer parameters and providing more reliable solutions. Additionally our DT-L formulation provides guarantees of convergence of the learned iterative procedure to a unique solution at inference time. We demonstrate DT-L is capable of robustly learning algorithms which extrapolate to harder problems than in the training set. We benchmark on the traveling salesperson problem to evaluate the capabilities of the modified system in an NP-hard problem where DT fails to learn. 👉 https://lnkd.in/dqHW2ASs #machinelearning
To view or add a comment, sign in
-
📃Scientific paper: On the Interplay Between Stepsize Tuning and Progressive Sharpening Abstract: Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a fixed stepsize (Cohen et al, 2022). We investigate empirically how the sharpness evolves when using stepsize-tuners, the Armijo linesearch and Polyak stepsizes, that adapt the stepsize along the iterations to local quantities such as, implicitly, the sharpness itself. We find that the surprisingly poor performance of a classical Armijo linesearch in the deterministic setting may be well explained by its tendency to ever-increase the sharpness of the objective. On the other hand, we observe that Polyak stepsizes operate generally at the edge of stability or even slightly beyond, outperforming its Armijo and constant stepsizes counterparts in the deterministic setting. We conclude with an analysis that suggests unlocking stepsize tuners requires an understanding of the joint dynamics of the step size and the sharpness. ;Comment: Presented at the NeurIPS 2023 OPT Wokshop Continued on ES/IODE ➡️ https://etcse.fr/I5da ------- If you find this interesting, feel free to follow, comment and share. We need your help to enhance our visibility, so that our platform continues to serve you.
On the Interplay Between Stepsize Tuning and Progressive Sharpening
ethicseido.com
To view or add a comment, sign in
-
Knowledge Revision from the old machine learning days become counterfactuals in the age of deep learning! Which minimal change of a feature value would also change the output that has been derived from a learned model? Dimitrios Gunopulos shows how to minimize validity, constraint, sparsity loss in order to find counterfactuals. Then, he moves beyond local counterfactuals and, moreover, computes feasible counterfactuals in real-time -- another exciting talk at Intelligent Data Analysis 2024.
To view or add a comment, sign in
-
Example Difficulty from the Lens of Prediction Depth The "Prediction Depth" concept offers a granular measure of the computational challenge associated with making predictions for specific inputs. It also uncovers uncover correlations between the prediction depth of input and crucial aspects of model behavior, including uncertainty, confidence, accuracy, and learning speed. Link to the paper - https://lnkd.in/e-s7tDfP. Key Insights: 1. Insights into Model Dynamics: Discovers how early layers generalize while later layers memorize, reshaping our understanding of network behavior. 2. Robustness of Prediction Depth: Explores how prediction depth serves as a robust measure of example difficulty, consistent across datasets and architectures. 3. Implications for Training: It helps to gain insights into how networks prioritize learning easy data and simple functions first, influencing training dynamics/ 4. Margin Insights: On average, deep neural networks exhibit wider input and output margins—common measures of "local simplicity"—in the vicinity of data points with smaller prediction depths. #machinelearning #neuralnetworks #ExampleDifficulty #PredictionDepth
Deep Learning Through the Lens of Example Difficulty
arxiv.org
To view or add a comment, sign in
-
I'm so excited to share that my journey into the world of deep learning models has taken a major step forward with the publication of our article titled "Research and Development of a Modern Deep Learning Model for Emotional Analysis Management of Text Data" in Applied Sciences. So, so, so in time!!! 🚀 Huge thanks to my co-authors for their invaluable contributions! Check out the full article https://lnkd.in/d3nuHdEs #CSCES #DeepLearning #TextAnalysis #Research #AppliedSciences"
Research and Development of a Modern Deep Learning Model for Emotional Analysis Management of Text Data
mdpi.com
To view or add a comment, sign in
-
TimeGPT: The forecasting community is divided over the superiority of deep learning approaches, with some practitioners questioning their usefulness, accuracy, and complexity despite their success in other fields. Traditional statistical methods and newer machine learning models like XGBoost and LightGBM have shown promising results in both competitions and practical applications. Deep learning offers scalability, flexibility, and the potential for higher accuracy without complex feature engineering, aiming to simplify the forecasting process and handle large data volumes effectively. However, skepticism persists about the actual performance benefits of deep learning in time series analysis compared to simpler models. The lack of large-scale, standardized datasets for deep learning in time series is a notable hindrance. TimeGPT, a new foundation model based on Transformer architecture, demonstrates the potential of deep learning to outperform traditional and other machine learning models in forecasting by learning from a vast, diverse dataset and minimizing forecasting error.
TimeGPT-1
arxiv.org
To view or add a comment, sign in