Zak Jost’s Post

ML at AirBnb

6mo

Recently I've been playing with "Swarm Intelligence", wondering in the back of my mind how I might apply it to my "day job" and merge with modern methods in the age of transformers...etc. Today I came across a paper (https://lnkd.in/g_bUdGB7) in the TL;DR newsletter that uses many inferences from small models to generate much better solutions than individuals, and at 3x lower cost than using a large model. And it is merely generating several samples independently without any interaction among the models. What if we instead applied a swarm intelligence algorithm to this approach so that inference N was informed by all the prior inferences? Would we better navigate the exploit/explore trade-off? This is the sort of place my video series will be heading. Relevant quote from the abstract: "...when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage - the fraction of problems solved by any attempt - scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models."

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

arxiv.org

3 Comments

Francesco Pappone

Co-founder & CEO at AI Sparks | Vice President and Co-founder of Exo | MBA @CDI | Alta Scuola Politecnica XVIII | PhD student Sapienza - PSTP Technoscience

6mo

I think sampling each query many times is actually correct (so not having individual models from the swarm interact). On the other hand, when such models are called dynamically in a sequence of steps it could be interesting to model them with a topology instead of individual distinct samples. Of course this begs the question, at what point do you insert the interaction? Is it just at the end of each answer or is it for each token prediction (afterall one could think of autoregressive rollout as being a sequence of steps conditioned in context)

To view or add a comment, sign in

More Relevant Posts

Abby Morgan

AI/ML Growth Engineer @ Comet Opik | Technical Writer | Community Organizer | Mentor
9mo
Report this post
As large language models get larger and larger, and access to compute becomes even more competitive, optimization techniques are more important than ever. Speculative decoding is an optimization technique for inference that makes educated guesses about future tokens while generating the current token, all within a single forward pass. Learn more in “A Hitchhiker’s Guide to Speculative Decoding” from the PyTorch blog:

A Hitchhiker's Guide to Speculative Decoding

pytorch.org

4 Comments
Like Comment
To view or add a comment, sign in
Sugato Ray

VP, Data Scientist @ Truist | Physicist | MBA | MSc Physics | Data Science, ML and AI | Computer Vision | ex-IBM | IITB
9mo Edited
Report this post
Finetune LLMs with speculative decoding. IBM PyTorch team used Hugging Face’s TGI library; forked it and adapted their code. The write up with the key results is available on the official PyTorch blog. Blog: https://lnkd.in/g42iyfJY #LLMs #finetune #pytorch #ibm #SpeculativeDecoding

Abby Morgan

AI/ML Growth Engineer @ Comet Opik | Technical Writer | Community Organizer | Mentor
9mo

As large language models get larger and larger, and access to compute becomes even more competitive, optimization techniques are more important than ever. Speculative decoding is an optimization technique for inference that makes educated guesses about future tokens while generating the current token, all within a single forward pass. Learn more in “A Hitchhiker’s Guide to Speculative Decoding” from the PyTorch blog:

A Hitchhiker's Guide to Speculative Decoding

pytorch.org

1 Comment
Like Comment
To view or add a comment, sign in
Boris FELD

Senior software Engineer at Comet.ml
9mo
Report this post
As large language models get larger and larger, and access to compute becomes even more competitive, optimization techniques are more important than ever. Speculative decoding is an optimization technique for inference that makes educated guesses about future tokens while generating the current token, all within a single forward pass. Learn more in “A Hitchhiker’s Guide to Speculative Decoding” from the PyTorch blog:

A Hitchhiker's Guide to Speculative Decoding

pytorch.org
Like Comment
To view or add a comment, sign in
Rodrigo Miranda

Machine Learning Engineer | Data Scientist | Pre-Sales Engineer | Cloud Architect
8mo
Report this post
As large language models get larger and larger, and access to compute becomes even more competitive, optimization techniques are more important than ever. Speculative decoding is an optimization technique for inference that makes educated guesses about future tokens while generating the current token, all within a single forward pass. Interesting read from the PyTorch blog:

A Hitchhiker's Guide to Speculative Decoding

pytorch.org

1 Comment
Like Comment
To view or add a comment, sign in
Ronald Yu

Cartesia.ai
1mo
Report this post
New substack post: https://lnkd.in/gJjwcNB7 Meta’s new paper “Byte Latent Transformer: Patches Scale Better Than Tokens” eliminates LLMs’ dirtiest secret: tokenization. Previous works on eliminating tokenization have largely framed the central challenge as mitigating the fact that naively operating on raw bytes results in longer sequences and therefore extra compute for the same task. However, this is a losing battle as wasted compute will overshadow any minor benefits. The Byte Latent Transformer flips the problem on its head and shows how the flexibility of byte-modeling allows for shorter sequences and more efficient compute allocation compared to using a tokenizer. By showing itself to be more dynamic and compute efficient, the Byte Latent Transformer opens the door to becoming a full-blown replacement to tokenization rather than a niche alternative for specific use cases. However, even if the Byte Latent Transformer does end tokenization in its current form, my personal prediction is that the legacy of tokenization will live on as long as auto-regressive language modeling and the transformer live on in their current form.

The End of Tokenization

ronaldyu.substack.com
Like Comment
To view or add a comment, sign in
Andrew Athan
2mo Edited
Report this post
The reasoning in LLM trifecta: Interested in the topic of whether LLMs can reason? Read these three papers in order. Ask me if you have any questions about the content. 1) https://lnkd.in/g_D_XUcM from OpenAI. "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." The first paper on the phenomenon known as "grokking," where models learn to generalize if trained 1000x beyond the point where they perform well on in distribution tests. 2) https://lnkd.in/gTKhMdin. "Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization." A systematic exploration of systematic compositional and comparative reasoning. 3) https://lnkd.in/g89KHp9p from Apple. "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models" This is the paper that resulted in the clickbait headlines that Apple is claiming LLMs cannot reason. Having read #1 & #2 you will be in a better position to evaluate the claims. In traditional computation "reasoning" is embodied in systems that perform pattern matching, potentially recursively (think Prolog) to find the boolean logic tree that matches the data. My sense is that, particularly for models that do not have recurrence, symbolic reasoning circuits may emerge (though perhaps only in the grokking phase); if they do, they must essentially be unrolled loops of the kind a recursive pattern match search would have done. Therefore, the degree to which they can reason is finite and related to the unrolled loop depth. Thoughts? PS: Reading these papers you will also learn about "logit lens" and "causal tracing," important tools in mechanistic interpretability of models. FOLLOW UP: Potentially, Grokking should be understood as a phenomenon seen in the training of models in a certain Goldilocks zone of model size vs training set characteristics (such as size, content, synthetic vs wild, etc), relevant to achieving generalization/systematization in that setting; rather than as a necessary methodology to achieve the same in SoTA sized models trained on typical commercial model datasets.

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

arxiv.org
Like Comment
To view or add a comment, sign in
Yevgen Reztsov

15+ years in ML, 3x Founder, ex E9 / Distinguished Engineer
9mo Edited
Report this post
RE: LLMs, the current hype train - 1) https://lnkd.in/dvS5UwhH 2) Extremely underrated: https://lnkd.in/dM9wx3GG Hardware advances alone are likely capable of taking us to AGI or ASI equivalent (not actualized - important nuance) within under a decade. If I had to place bets, it'll happen in the next 18 months. Side note: quite a few big names are holding off on releasing new models until the current US presidential election cycle is over. Software optimizations like the one linked or anything else that uses params / tokens more efficiently would do it. If any kind of breakthrough is made for matrix multiplication (math, software or hardware), it's instant game over. A great napkin math formula for evaluating a model's "power" is: Square root of (Parameters x Tokens) / 300. Right now, Claude 3 Opus scores about a 29.8 out of that formula. Gemini 1.5 Pro (and interestingly enough, Gemini 1.0 Ultra) both come out to 22.4. Another side note - Streaming LLMs and some RNNs technically have an infinite context window. Never mind the token recall, though. Given the current primitive state of LLMs, token recall curves are going to increase rapidly - a metric arguably much more important than the context window. Keep in mind, the data sets for training are still also relatively miniscule. GPT4's was around 40TB. As these keep getting refined and increase in both quantity and quality, another near exponential mark of progress can be expected. Etc etc you get the idea. </rant>

Consistency Large Language Models: A Family of Efficient Parallel Decoders

hao-ai-lab.github.io

1 Comment
Like Comment
To view or add a comment, sign in
VannGuard AI

438 followers
6mo
Report this post
Have you ever wondered why your Reterival Augementated Generation (RAG) pipeline is not working the way you wanted it to work? We all have been in this situation before. In this article, our Machine Learning Engineer Hamza talks about how we can enhance RAG performance using contextualized embeddings to improve document retrieval. Article Link: https://lnkd.in/daAu8zhA

Enhancing RAG Performance Through Contextualized Embeddings

medium.com

11 Comments
Like Comment
To view or add a comment, sign in
Aniket Mishrikotkar

ML @ Alaan (YC W23) | MLOps | LLMs | Agents
7mo
Report this post
(large) language models started with millions of parameters and quickly grew to trillions (OpenAI's GPT4 and Anthropic's Opus have ~2T). most people interact with these models through APIs, where the computational heavy lifting happens on servers with the ability to scale resources. if run locally, it can put a strain on the hardware. one effective way to reduce this computational demand is to increase power efficiency through quantization. in simple terms, quantization lessens the number of bits needed to represent information. quantization lowers the precision for mathematical operations such as INT8 instead of industry-standard FP32. this helps reduce memory bandwidth, storage 💾 usage, and battery🔋usage if deployed on mobile devices. Maarten Grootendorst wrote a no-nonsense visual guide to quantization—a must-read for anyone who wants to understand quantization in detail. Link to the article - https://lnkd.in/gpsrU8cU

A Visual Guide to Quantization

newsletter.maartengrootendorst.com
Like Comment
To view or add a comment, sign in
Said Dandamaev

Quantitative Risk Analyst @ VTB | MSc in Financial Mathematics UoB | Expert in Machine Learning & Stochastic Modeling
8mo
Report this post
🚀 Exciting Breakthrough in Quantitative Finance: Introducing xLSTM! 🚀 It's crucial to stay on top of technological developments in the rapidly developing field of quantitative finance. I'm excited to discuss xLSTM (Extended Long Short-Term Memory), an promising advancement in neural network architectures that I recently learned about and that has a lot of potential for our field. 🔍 What is xLSTM? xLSTM is an enhanced version of the traditional LSTM model, which has been a widely-used approach in solving time series regression tasks. The original authors of LSTM have now introduced exponential gating and novel memory structures, making xLSTM a game-changer. 🌟 Key Innovations: Exponential Gating: Improves control over information flow, dynamically revising stored values. Memory Structures: sLSTM: Enhances sequential dependency handling with scalar memory and new mixing techniques. mLSTM: Introduces matrix memory, enabling parallel operations and superior storage capacity. Residual Block Integration: Stacks these advanced memory units into robust xLSTM architectures, outperforming current state-of-the-art models like transformers. 💡 Why is this Important? For quantitative finance, the ability to model complex time series data accurately is of paramount importance. xLSTM's advancements make it particularly well-suited for: High-frequency trading: Better prediction accuracy with long-term dependencies. Risk management: Enhanced modeling of rare events and anomalies. Algorithmic trading: Superior performance in processing large volumes of sequential data. Original article: https://lnkd.in/eqMpzAcb 📄 Share your thoughts about it in the comments. Should I write about Python packages for applying this model and give some examples of using it? 🤔 #QuantitativeFinance #DeepLearning #NeuralNetworks #LSTM #xLSTM #TimeSeriesAnalysis #Innovation #FinanceTech

2405.04517

arxiv.org
Like Comment
To view or add a comment, sign in

7,746 followers

266 Posts

View Profile Connect

Zak Jost’s Post

More Relevant Posts

Explore topics