Zak Jost’s Post

Recently I've been playing with "Swarm Intelligence", wondering in the back of my mind how I might apply it to my "day job" and merge with modern methods in the age of transformers...etc. Today I came across a paper (https://lnkd.in/g_bUdGB7) in the TL;DR newsletter that uses many inferences from small models to generate much better solutions than individuals, and at 3x lower cost than using a large model. And it is merely generating several samples independently without any interaction among the models. What if we instead applied a swarm intelligence algorithm to this approach so that inference N was informed by all the prior inferences? Would we better navigate the exploit/explore trade-off? This is the sort of place my video series will be heading. Relevant quote from the abstract: "...when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage - the fraction of problems solved by any attempt - scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models."

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

arxiv.org

Francesco Pappone

Co-founder & CEO at AI Sparks | Vice President and Co-founder of Exo | MBA @CDI | Alta Scuola Politecnica XVIII | PhD student Sapienza - PSTP Technoscience

6mo

I think sampling each query many times is actually correct (so not having individual models from the swarm interact). On the other hand, when such models are called dynamically in a sequence of steps it could be interesting to model them with a topology instead of individual distinct samples. Of course this begs the question, at what point do you insert the interaction? Is it just at the end of each answer or is it for each token prediction (afterall one could think of autoregressive rollout as being a sequence of steps conditioned in context)

Like
Reply

To view or add a comment, sign in

Explore topics