Beating o1-preview on AIME 2024 with Chain-of-Code reasoning in Optillm In the past week, there has been a flurry of releases of o1-style reasoning models from DeepSeek, Fireworks AI and Nous Research. In our open-source optimizing inference proxy, optillm (https://lnkd.in/gN6_kNky) we have implemented several techniques that use additional inference time compute to improve accuracy and work with a variety of base models. Today, we are happy to announce that by using chain-of-code (coc) plugin in optillm we are able to beat OpenAI's o1-preview on AIME 2024 (pass@1) using SOTA base models from both Anthropic and Google DeepMind. For reference, also see the original paper that introduced the idea of CoC: Chain of Code: Reasoning with a Language Model-Augmented Code Emulator -https://lnkd.in/gtq7hbjx We have done an independent implementation of coc in optillm as the original source code was not released.
Asankhaya Sharma’s Post
More Relevant Posts
-
OpenAI’s latest foundation model, o3, has redefined AI’s ability to tackle complex reasoning tasks, achieving unprecedented performance on the ARC-AGI test—a milestone hailed as a leap forward in machine understanding of novel problems.
To view or add a comment, sign in
-
Imagine a world where some people cheat, but there’s no way to tell. And then suddenly, there is. The problem with LLM benchmarks is that they are contaminated. Model makers know the “answers” ahead of time, and train on them. This may not even be intentional, but it is hard to avoid, because lots of people write about LLM benchmarks, and they get copied into different places that are then swept up in the maw of a model’s web scraper. Whether it happens by accident or on purpose, it is the same as feeding your “test” data into your training set. Also the same as cheating on a test. You end up with a high test score that doesn’t generalize to new questions. MIT has a new benchmark called LiveCodeBench that tries to get around this problem. They do so by continuously sourcing fresh questions from code competition sites like LeetCode. The questions are tagged with their release date. When you test an LLM, you only include code problems released after its knowledge cutoff. Viola. - GPT4 and Claude Opus top the new, decontaminated charts - DeepSeek shows significant overfitting. This model powers DeepSeek Coder. - Google’s Gemini Pro is in the middle of the pack Here’s the full chart: Paper: https://buff.ly/4aCb9Pw Code: https://buff.ly/4aUWlLD #OpenSource #AI #MachineLearning #GPT4 #Benchmarks #LLMBenchmarks #Overfitting #CodingBenchmark #LeetCode #KnowledgeCutoff #ModelEvaluation #CheatingDetection #ContaminatedData #ArtificialIntelligence
To view or add a comment, sign in
-
-
Exciting to see enterprises integrating DeepSeek models into their applications via our portfolio company Recursal.ai's platform Featherless, which runs any Large Language Model (LLM) available on Hugging Face with a click.
DeepSeek-R1 by DeepSeek AI achieves performance comparable to OpenAI o1 across math, code and reasoning tasks. A SOTA open-source model with open reasoning tokens. Now available for premium users on http://featherless.ai DeepSeek-R1 is available on Featherless through: - Phoenix: https://lnkd.in/dt9aYxfT - Our API: https://lnkd.in/dNKSkupD
To view or add a comment, sign in
-
-
DeepSeek R1 vs OpenAI O1—who’s leading the AI game? DeepSeek R1 is turning heads with its efficiency and performance: Excels in math (97.3%) and coding (96.3rd percentile). Open-source and incredibly cost-effective, with API costs as low as $0.14/1M tokens compared to OpenAI’s $7.5. With its cutting-edge approach and affordability, DeepSeek R1 is proving to be a serious contender in the AI space. What do you think? Could this change the future of AI?
To view or add a comment, sign in
-
-
I recently tested Deep Seek R1 and O1 with two reasoning challenges. In the first challenge, a deceptively simple pen-counting problem, DeepSeek outperformed OpenAI by correctly understanding family boundaries and contextual relationships. While OpenAI focused purely on numbers (giving an incorrect answer), DeepSeek demonstrated impressive reasoning to arrive at the correct answer. The second challenge involved complex spatial reasoning with colored matchboxes. Both models found the correct solution, but OpenAI showed impressive efficiency, completing the task in 40 seconds compared to DeepSeek's 120 seconds. OpenAI's solution was elegantly minimal, while DeepSeek's approach was more thorough but included unnecessary steps. It's impressive to see Deep Seek's R1 performing on par with O1 [For a detailed analysis of their reasoning patterns and technical insights, read the full link here : https://lnkd.in/g8QUawPG ]
To view or add a comment, sign in
-
So, that was a crazy weekend! The release of DeepSeek-R1 has jolted the entire AI space, with a novel approach - Guided Policy Reinforcement Optimization (GPRO) - that improves reasoning and language alignment, while reducing training effort. How does DeepSeek-R1 compare to OpenAI o1? We used Weave to run evals comparing the two reasoning models, and what did we find? TLDR: DeepSeek-R1 matched OpenAI o1's accuracy, but R1-14B's lower performance vs. posted benchmarks raised some questions for us. Want to learn more? We wrote a guide that dives deep into reasoning model setup and evaluation, check out the link in comments.
To view or add a comment, sign in
-
DeepSeek AI just released a fully open source reasoning model Deepseek R1 which is on par with OpenAI o1. 🚀 Outperforms Claude 3.5 Sonnet and o1-mini in almost all benchmarks. 📖 Open weights 🏆 MIT license 🎵 API outputs can be used for finetuning/distillation - very useful and bridges the gap between general-purpose models and task-specific needs while mitigating resource constraints.
To view or add a comment, sign in
-
-
DEEPSEEK AI dropped DeepSeek-R1, an open-source AI that's both smart and budget-friendly! 🧠 Brainy Match: It's going toe-to-toe with OpenAI's o1 in math and coding, but guess what? It's 90-95% cheaper! 🔓 Open Source: You can grab it on Hugging Face with an MIT license, making it super accessible. 💬 Give it a Try: Test it out on the DeepSeek chat platform and see the magic for yourself. 🔗 More Deets: Dive in here: https://lnkd.in/ea-UcKZN 🔗 Stay Plugged In: Don't miss out, get updated by following Botcom
To view or add a comment, sign in
-
-
🎯 𝗠𝘂𝗹𝘁𝗶-𝗛𝗲𝗮𝗱 𝗥𝗔𝗚 𝗳𝗿𝗼𝗺 𝗦𝗰𝗿𝗮𝘁𝗰𝗵 Multi-Head RAG built from scratch in Colab without using frameworks like Langchain or LlamaIndex. This example uses OpenAI's text-embedding-ada-002, Meta Llama3:8b, and Mistral AI:7B models to create embedding spaces. You can use your choice of embedding models too, for creating embedding space. 📄 Paper - https://lnkd.in/gAPQQYKw 🔨 Github - https://lnkd.in/gcRPR9PV 🌟 Checkout other Build from Scratch examples - https://lnkd.in/gXXRmc-T #scratch #MRAG #rag #embeddings #vectordb
To view or add a comment, sign in
-
-
One of the first things you need to do when #automating #accounting is to #automate the process of entering #data. This innovation converts #image files to #JSON files.
Last week, Hugging Face released Idefics2, its new open-source vision-language model. 🔥 Idefics initially started as a replication of Google DeepMind's Flamingo model, the latter being never open-sourced. The first version was released about a year ago, along with a dataset called Obelics, and for most benchmarks it was already on par with it. 🙌 With Idefics2, a lot of advancements in the field were incorporated, making it one of the best (if not the best) vision-language model for its size. The team also released a paper that consolidates all their learnings, titled "What matters when building vision-language models?": https://lnkd.in/eQdrCYUE. Improvements include: 💥 NaViT patching (https://lnkd.in/epCWtNTs): NaViT is a new method developed by Google to cleverly patchify images while maintaining their aspect ratio. This results in higher performance, especially for OCR. 💥 Perceiver resampling (https://lnkd.in/eK5Xtgri). The Perceiver is a really cool model (I contributed that one to the Transformers library) which allows encoding information into a set of fixed latent variables. This allows to encode a lot of images, or even video, with just a few latent variables. This way, Idefics2 can be used on PDFs comprising multiple pages for instance, as it just uses either 64 or 320 tokens to represent visual information. 💥 using the latest and greatest vision and language backbones: Idefics2 is built on top of SigLIP (also from Google) for the vision encoder and Mistral-7B (from Mistral AI) for the language backbone. This gives it a huge boost in performance on various benchmarks. To celebrate its release, I've created a demo notebook which showcases how to fine-tune Idefics2 for document AI use cases, where you want to go from a receipt image to a JSON for instance. 🚀 Demo notebook: https://lnkd.in/eRqGinsD Idefics2 base model: https://lnkd.in/eQFBsYvq Idefics2 chatty version: https://lnkd.in/e2dPm8wJ
To view or add a comment, sign in
-
Stanford CS Grad, Chief Scientist, Taylor AI (YC S23)
3mo👀👀