Recently I've been playing with "Swarm Intelligence", wondering in the back of my mind how I might apply it to my "day job" and merge with modern methods in the age of transformers...etc. Today I came across a paper (https://lnkd.in/g_bUdGB7) in the TL;DR newsletter that uses many inferences from small models to generate much better solutions than individuals, and at 3x lower cost than using a large model. And it is merely generating several samples independently without any interaction among the models. What if we instead applied a swarm intelligence algorithm to this approach so that inference N was informed by all the prior inferences? Would we better navigate the exploit/explore trade-off? This is the sort of place my video series will be heading. Relevant quote from the abstract: "...when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage - the fraction of problems solved by any attempt - scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models."
Zak Jost’s Post
More Relevant Posts
-
As large language models get larger and larger, and access to compute becomes even more competitive, optimization techniques are more important than ever. Speculative decoding is an optimization technique for inference that makes educated guesses about future tokens while generating the current token, all within a single forward pass. Learn more in “A Hitchhiker’s Guide to Speculative Decoding” from the PyTorch blog:
A Hitchhiker's Guide to Speculative Decoding
pytorch.org
To view or add a comment, sign in
-
Finetune LLMs with speculative decoding. IBM PyTorch team used Hugging Face’s TGI library; forked it and adapted their code. The write up with the key results is available on the official PyTorch blog. Blog: https://lnkd.in/g42iyfJY #LLMs #finetune #pytorch #ibm #SpeculativeDecoding
As large language models get larger and larger, and access to compute becomes even more competitive, optimization techniques are more important than ever. Speculative decoding is an optimization technique for inference that makes educated guesses about future tokens while generating the current token, all within a single forward pass. Learn more in “A Hitchhiker’s Guide to Speculative Decoding” from the PyTorch blog:
A Hitchhiker's Guide to Speculative Decoding
pytorch.org
To view or add a comment, sign in
-
As large language models get larger and larger, and access to compute becomes even more competitive, optimization techniques are more important than ever. Speculative decoding is an optimization technique for inference that makes educated guesses about future tokens while generating the current token, all within a single forward pass. Learn more in “A Hitchhiker’s Guide to Speculative Decoding” from the PyTorch blog:
A Hitchhiker's Guide to Speculative Decoding
pytorch.org
To view or add a comment, sign in
-
As large language models get larger and larger, and access to compute becomes even more competitive, optimization techniques are more important than ever. Speculative decoding is an optimization technique for inference that makes educated guesses about future tokens while generating the current token, all within a single forward pass. Interesting read from the PyTorch blog:
A Hitchhiker's Guide to Speculative Decoding
pytorch.org
To view or add a comment, sign in
-
New substack post: https://lnkd.in/gJjwcNB7 Meta’s new paper “Byte Latent Transformer: Patches Scale Better Than Tokens” eliminates LLMs’ dirtiest secret: tokenization. Previous works on eliminating tokenization have largely framed the central challenge as mitigating the fact that naively operating on raw bytes results in longer sequences and therefore extra compute for the same task. However, this is a losing battle as wasted compute will overshadow any minor benefits. The Byte Latent Transformer flips the problem on its head and shows how the flexibility of byte-modeling allows for shorter sequences and more efficient compute allocation compared to using a tokenizer. By showing itself to be more dynamic and compute efficient, the Byte Latent Transformer opens the door to becoming a full-blown replacement to tokenization rather than a niche alternative for specific use cases. However, even if the Byte Latent Transformer does end tokenization in its current form, my personal prediction is that the legacy of tokenization will live on as long as auto-regressive language modeling and the transformer live on in their current form.
The End of Tokenization
ronaldyu.substack.com
To view or add a comment, sign in
-
The reasoning in LLM trifecta: Interested in the topic of whether LLMs can reason? Read these three papers in order. Ask me if you have any questions about the content. 1) https://lnkd.in/g_D_XUcM from OpenAI. "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." The first paper on the phenomenon known as "grokking," where models learn to generalize if trained 1000x beyond the point where they perform well on in distribution tests. 2) https://lnkd.in/gTKhMdin. "Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization." A systematic exploration of systematic compositional and comparative reasoning. 3) https://lnkd.in/g89KHp9p from Apple. "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models" This is the paper that resulted in the clickbait headlines that Apple is claiming LLMs cannot reason. Having read #1 & #2 you will be in a better position to evaluate the claims. In traditional computation "reasoning" is embodied in systems that perform pattern matching, potentially recursively (think Prolog) to find the boolean logic tree that matches the data. My sense is that, particularly for models that do not have recurrence, symbolic reasoning circuits may emerge (though perhaps only in the grokking phase); if they do, they must essentially be unrolled loops of the kind a recursive pattern match search would have done. Therefore, the degree to which they can reason is finite and related to the unrolled loop depth. Thoughts? PS: Reading these papers you will also learn about "logit lens" and "causal tracing," important tools in mechanistic interpretability of models. FOLLOW UP: Potentially, Grokking should be understood as a phenomenon seen in the training of models in a certain Goldilocks zone of model size vs training set characteristics (such as size, content, synthetic vs wild, etc), relevant to achieving generalization/systematization in that setting; rather than as a necessary methodology to achieve the same in SoTA sized models trained on typical commercial model datasets.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
arxiv.org
To view or add a comment, sign in
-
RE: LLMs, the current hype train - 1) https://lnkd.in/dvS5UwhH 2) Extremely underrated: https://lnkd.in/dM9wx3GG Hardware advances alone are likely capable of taking us to AGI or ASI equivalent (not actualized - important nuance) within under a decade. If I had to place bets, it'll happen in the next 18 months. Side note: quite a few big names are holding off on releasing new models until the current US presidential election cycle is over. Software optimizations like the one linked or anything else that uses params / tokens more efficiently would do it. If any kind of breakthrough is made for matrix multiplication (math, software or hardware), it's instant game over. A great napkin math formula for evaluating a model's "power" is: Square root of (Parameters x Tokens) / 300. Right now, Claude 3 Opus scores about a 29.8 out of that formula. Gemini 1.5 Pro (and interestingly enough, Gemini 1.0 Ultra) both come out to 22.4. Another side note - Streaming LLMs and some RNNs technically have an infinite context window. Never mind the token recall, though. Given the current primitive state of LLMs, token recall curves are going to increase rapidly - a metric arguably much more important than the context window. Keep in mind, the data sets for training are still also relatively miniscule. GPT4's was around 40TB. As these keep getting refined and increase in both quantity and quality, another near exponential mark of progress can be expected. Etc etc you get the idea. </rant>
Consistency Large Language Models: A Family of Efficient Parallel Decoders
hao-ai-lab.github.io
To view or add a comment, sign in
-
Have you ever wondered why your Reterival Augementated Generation (RAG) pipeline is not working the way you wanted it to work? We all have been in this situation before. In this article, our Machine Learning Engineer Hamza talks about how we can enhance RAG performance using contextualized embeddings to improve document retrieval. Article Link: https://lnkd.in/daAu8zhA
Enhancing RAG Performance Through Contextualized Embeddings
medium.com
To view or add a comment, sign in
-
(large) language models started with millions of parameters and quickly grew to trillions (OpenAI's GPT4 and Anthropic's Opus have ~2T). most people interact with these models through APIs, where the computational heavy lifting happens on servers with the ability to scale resources. if run locally, it can put a strain on the hardware. one effective way to reduce this computational demand is to increase power efficiency through quantization. in simple terms, quantization lessens the number of bits needed to represent information. quantization lowers the precision for mathematical operations such as INT8 instead of industry-standard FP32. this helps reduce memory bandwidth, storage 💾 usage, and battery🔋usage if deployed on mobile devices. Maarten Grootendorst wrote a no-nonsense visual guide to quantization—a must-read for anyone who wants to understand quantization in detail. Link to the article - https://lnkd.in/gpsrU8cU
A Visual Guide to Quantization
newsletter.maartengrootendorst.com
To view or add a comment, sign in
-
🚀 Exciting Breakthrough in Quantitative Finance: Introducing xLSTM! 🚀 It's crucial to stay on top of technological developments in the rapidly developing field of quantitative finance. I'm excited to discuss xLSTM (Extended Long Short-Term Memory), an promising advancement in neural network architectures that I recently learned about and that has a lot of potential for our field. 🔍 What is xLSTM? xLSTM is an enhanced version of the traditional LSTM model, which has been a widely-used approach in solving time series regression tasks. The original authors of LSTM have now introduced exponential gating and novel memory structures, making xLSTM a game-changer. 🌟 Key Innovations: Exponential Gating: Improves control over information flow, dynamically revising stored values. Memory Structures: sLSTM: Enhances sequential dependency handling with scalar memory and new mixing techniques. mLSTM: Introduces matrix memory, enabling parallel operations and superior storage capacity. Residual Block Integration: Stacks these advanced memory units into robust xLSTM architectures, outperforming current state-of-the-art models like transformers. 💡 Why is this Important? For quantitative finance, the ability to model complex time series data accurately is of paramount importance. xLSTM's advancements make it particularly well-suited for: High-frequency trading: Better prediction accuracy with long-term dependencies. Risk management: Enhanced modeling of rare events and anomalies. Algorithmic trading: Superior performance in processing large volumes of sequential data. Original article: https://lnkd.in/eqMpzAcb 📄 Share your thoughts about it in the comments. Should I write about Python packages for applying this model and give some examples of using it? 🤔 #QuantitativeFinance #DeepLearning #NeuralNetworks #LSTM #xLSTM #TimeSeriesAnalysis #Innovation #FinanceTech
2405.04517
arxiv.org
To view or add a comment, sign in
Co-founder & CEO at AI Sparks | Vice President and Co-founder of Exo | MBA @CDI | Alta Scuola Politecnica XVIII | PhD student Sapienza - PSTP Technoscience
6moI think sampling each query many times is actually correct (so not having individual models from the swarm interact). On the other hand, when such models are called dynamically in a sequence of steps it could be interesting to model them with a topology instead of individual distinct samples. Of course this begs the question, at what point do you insert the interaction? Is it just at the end of each answer or is it for each token prediction (afterall one could think of autoregressive rollout as being a sequence of steps conditioned in context)