LiveBench: A benchmark for LLMs by LiveBench

9mo

𝐄𝐱𝐩𝐥𝐨𝐫𝐢𝐧𝐠 𝐋𝐢𝐯𝐞𝐁𝐞𝐧𝐜𝐡: 𝐀 𝐍𝐞𝐰 𝐅𝐫𝐨𝐧𝐭𝐢𝐞𝐫 𝐢𝐧 𝐋𝐋𝐌 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤𝐢𝐧𝐠 🌟 Great news for the AI and machine learning community! Let's dive into 𝐋𝐢𝐯𝐞𝐁𝐞𝐧𝐜𝐡, a pioneering benchmark developed by researchers aiming to robustly evaluate Large Language Models (LLMs) without the common pitfalls of dataset contamination or biases from human or LLM judging. 🔍 𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬 𝐨𝐟 𝐋𝐢𝐯𝐞𝐁𝐞𝐧𝐜𝐡: 𝐑𝐞𝐠𝐮𝐥𝐚𝐫𝐥𝐲 𝐔𝐩𝐝𝐚𝐭𝐞𝐝: LiveBench is dynamic, with questions sourced from the latest math competitions and academic papers, updated monthly. 𝐎𝐛𝐣𝐞𝐜𝐭𝐢𝐯𝐞 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧: Utilizes automatic scoring based on objective truth, avoiding biases that can occur with human or automated judges. 𝐂𝐨𝐦𝐩𝐫𝐞𝐡𝐞𝐧𝐬𝐢𝐯𝐞 𝐓𝐞𝐬𝐭𝐢𝐧𝐠: Includes a variety of tasks such as math, coding, reasoning, language, instruction following, and data analysis, designed to thoroughly test LLM capabilities. 📈 The rigorous nature of LiveBench is evident as even the top performing models achieve below 60% accuracy, illustrating the challenge it poses and its role in driving model improvements. 🌍 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐭𝐲 𝐈𝐧𝐯𝐨𝐥𝐯𝐞𝐦𝐞𝐧𝐭: The benchmark invites contributions and collaboration from the global AI community. To get involved or to learn more about how you can use LiveBench for your projects, check out their resources on GitHub or visit the LiveBench.ai leaderboard. 🛠️ 𝐖𝐡𝐲 𝐢𝐬 𝐋𝐢𝐯𝐞𝐁𝐞𝐧𝐜𝐡 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐭? Traditional benchmarks quickly become outdated as they are absorbed into the training datasets of new models. LiveBench addresses this by offering a continually updated and contamination-free testing environment. 👀 Stay tuned for updates and developments from this exciting initiative, which is setting new standards in the evaluation of AI models! #AI #GenAI #LLM #MachineLearning #DataScience #Benchmarking #TechnologyUpdates #ArtificialIntelligence

To view or add a comment, sign in

More Relevant Posts

Advay Krishnakant Parab

Aspiring Data Scientist | Passionate About Machine Learning & Data-Driven Solutions | Expertise in Predictive Modeling & Data Analysis
3mo
Report this post
Day 2: Embeddings, Vector Databases, and Text Similarity 🧠🔍 Today's Highlights: 1.) Understanding Embeddings: Dove deeper into the conceptual foundations of embeddings and how they represent text as numerical vectors. 2.) Vector Databases and Retrieval: Explored the role of vector stores and databases in bringing live or specialist data into LLM applications. 3.) Text Similarity Workflows: learned techniques to leverage embeddings for classifying and comparing textual data. Hands-On Practice: >> Built a RAG (Retrieval-Augmented Generation) question-answering system over custom documents >> Experimented with text similarity using embeddings on Kaggle >> Constructed a neural classification network with Keras, utilizing embedded Key Learnings: 1.) Embeddings Power: Understand how embeddings capture semantic relationships between words and documents, enabling powerful text understanding. 2.) Vector Databases: learned how these specialized databases allow for efficient storage and retrieval of embedding-based data. 3.) Text Similarity Workflows: Explored techniques to measure and apply text similarity, from classification to recommendation systems. Essential Concepts: >> Embedding Algorithms (e.g., Word2Vec, GloVe, BERT) >> Approximate Nearest Neighbor (ANN) Search >> Cosine Similarity, Euclidean Distance >> Transfer Learning with Pretrained Embeddings Prompting Techniques: >> Retrieval-Augmented Generation (RAG) >> Embedding-based Information Retrieval Looking Ahead to Day 3: Tomorrow, we'll dive into the intriguing world of multimodal AI, exploring how we can leverage both text and visual data to unlock new capabilities. Can't wait to see what insights and hands-on skills await! Let me know if you have any other questions about today's learnings or the 5-Day Gen AI Intensive. I'm happy to discuss further! #GenerativeAI, #AITraining, #MachineLearning, #PromptEngineering, #AIInnovation, #LLM, #DataScience, #Kaggle, #AICommunity, #TechLearning, #Embeddings, #VectorDatabases, #TextSimilarity
Like Comment
To view or add a comment, sign in
Daniel Sarfraz

3D Generalist - Bringing Fantastic CG Visuals To Films and Commercials.
9mo
Report this post
🚀 Excited to Share My Latest AI Project! 🚀 I'm thrilled to present my recent work on training a small language model (SLM) inspired by Andrej Karpathy's nanoGPT. This experiment aimed to push the boundaries of what a relatively small model can achieve in terms of generating coherent text. 🔍 Project Highlights: - Model Size: 123.59 million parameters. - Datasets: High Quality datasets sourced from Hugging Face. - Training: Initial training on ~1.4 billion tokens, fine-tuning to enhance specific task performance. - Results: The model generates coherent text but faces challenges with instruction-following and information retrieval—demonstrating the limitations and capabilities of smaller models. 📂 Explore the project on GitHub: https://lnkd.in/eMSBAQwx 💡 Acknowledgements: A huge thank you to Andrej Karpathy for nanoGPT and to the dataset providers on Hugging Face. Feel free to fork the repo and send in your pull requests. Let's push the boundaries of AI together! 🌟 #AI #MachineLearning #LanguageModel #DeepLearning #HuggingFace #OpenSource #ArtificialIntelligence #DataScience
Like Comment
To view or add a comment, sign in
Pi School

5,427 followers
3mo
Report this post
🚀 𝐏𝐢 𝐀𝐈 𝐖𝐞𝐞𝐤𝐥𝐲 𝐓𝐫𝐞𝐧𝐝𝐬 #𝟐𝟐 𝐢𝐬 𝐡𝐞𝐫𝐞! It’s Friday! Get ready to stay ahead with the latest AI breakthroughs, handpicked by our Senior Deep Learning Scientist, Àlex R. Atrio. This week’s highlights: 📚𝐃𝐢𝐟𝐟𝐋𝐌: a framework for synthetic data generation with LLMs. It achieves great results on several benchmarks for structured generation by combining a Variational Autoencoder trained on a dataset and a Diffusion Model LM to generate structured output based on the original data. 🌐 https://pischool.link/h5t 🐦𝐌𝐀𝐆𝐏𝐈𝐄: a promising method to create high-quality synthetic instruction data by extracting it from instructed open-weight models like Llama-3-8B-Instruct, which don’t open-source their alignment data. Synthetic user queries are generated by inputting only the pre-query templates up to the position reserved for the user message to the instructed LLM. 🌐 https://pischool.link/7ia 🤖𝐀𝐝𝐚𝐩𝐭𝐢𝐧𝐠 𝐖𝐡𝐢𝐥𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠: this method targets the complex issue of solving scientific problems with LLMs by connecting the model with external tools, such as APIs, numerical solvers, or a Python code interpreter. They train an LLM to subdivide a given complex scientific problem, label the resulting sub-problems based on their difficulty, and then choose the most suitable tool for each sub-problem. 🌐 https://pischool.link/0ga Was this helpful? Let us know by liking and sharing! #AI #MachineLearning #DeepLearning #FoundationModels #PiAIWeeklyTrends
Like Comment
To view or add a comment, sign in
Raj Sanshi

Master’s Student in AI | Machine Learning Enthusiast | Chatbot & AI Agent Developer | GenAI & RAG Specialist
1mo Edited
Report this post
🌟 🔬 Groundbreaking Research Alert: Small Language Models Take on the Giants! The narrative that bigger is always better in AI is being reshaped! 🚀 Microsoft, in collaboration with Peking and Tsinghua Universities, has introduced the rStar-Math technique, a groundbreaking approach to boost the performance of Small Language Models (SLMs). Using Monte Carlo Tree Search (MCTS) combined with "chain-of-thought" reasoning, rStar-Math enables smaller models to handle complex mathematical reasoning tasks, often matching or outperforming larger models like OpenAI's o1-preview. 🎯 Key achievements include: ✅ 90% accuracy on the MATH benchmark (12,500 questions). ✅ Solved 53.3% of AIME problems, ranking in the top 20% of high school competitors. ✅ Enhanced models like Qwen-1.5B and Qwen-7B to rival larger counterparts. 💡 Why this matters: 1️⃣ Cost Efficiency: Smaller models require fewer computational resources, reducing financial and environmental costs. 2️⃣ Accessibility: Mid-sized organizations and academic researchers gain access to state-of-the-art capabilities without the prohibitive costs of massive models. 3️⃣ Innovation in Reasoning: Techniques like MCTS and step-by-step reasoning not only simplify complex problems but also pave the way for advancements in geometric proofs and symbolic reasoning. This marks a paradigm shift in AI development ,focusing on efficiency and specialization rather than sheer size. The potential applications for education, research, and industry are immense. 🌍 📌 As we await the open-source release of rStar-Math on GitHub (currently under internal review), it's clear this innovation will spark a new wave of exploration in compact, powerful AI systems. #ArtificialIntelligence #AIInnovation #SmallLanguageModels #rStarMath #MicrosoftAI #MachineLearning #FutureOfAI
1 Comment
Like Comment
To view or add a comment, sign in
Shivansh Srivastava

AI Engineer @ techolution | Google Certified ML Engineer | Red Hat Certified (3X)
1mo Edited
Report this post
🚀 DeepSeek-R1: A Paradigm Shift in LLM Reasoning! The AI landscape just witnessed a major breakthrough! DeepSeek-R1, a revolutionary Large Language Model (LLM), has proven that pure Reinforcement Learning (RL) can significantly enhance reasoning capabilities without a single byte of supervised fine-tuning data. Github Repo: https://lnkd.in/g9SyDsdy I decided to run it locally on my system and test its reasoning with a fun query running: “How many ‘r’ are there in the word strawberry?” 🍓 DeepSeek-R1 responded with a fascinating chain of thought, refer to the image below. This demonstrates the model’s ability to reason step-by-step, showcasing the power of RL-driven training in handling language tasks with ease! 🔥 Why Is DeepSeek-R1 Special? 💡 Zero Supervised Data, 100% RL Forget traditional supervised fine-tuning—DeepSeek-R1-Zero evolved entirely through RL, improving itself over multiple iterations. 📊 Crushing Benchmarks Across Domains: •🧮 Mathematics: Achieved a stunning 97.3% on MATH-500, surpassing many top-tier models. •💻 Coding: Scored an impressive 96.3 percentile on Codeforces, demonstrating expert-level coding skills. •🧠 General Reasoning: Excelled across diverse logic and reasoning benchmarks. ❤️ Open-Source Power DeepSeek-AI has generously open-sourced versions of the model, ranging from 1.5B to 70B parameters, giving the AI community access to cutting-edge reasoning capabilities. This is a game-changer in AI research. Smaller models achieving remarkable feats through knowledge distillation show that size isn’t everything! Exciting times ahead 🚀 #AI #MachineLearning #DeepLearning #LLM #GenAI #AIAgents #aiagents #ReinforcementLearning #DeepSeekR1 #OpenSource
2 Comments
Like Comment
To view or add a comment, sign in
Isabel González

AI Engineer: Shaping the future with smart solutions 🤖
8mo
Report this post
Simple guide about embeddings in LLM development

Abhinav Kimothi
8mo

In the realm of AI, embeddings play a pivotal role by transforming non-numeric data into numerical form, enabling computers to process text and images effectively. In a Simple Guide to Retrieval Augmented Generation, you'll read about - - The significance of embeddings in data science, machine learning, and AI - Various embedding algorithms such as Word2Vec, GloVe, and BERT - Practical examples of using embeddings to enhance text search, clustering, and more Advancements in Large Language Models (LLMs) have revolutionized industries worldwide, but with great power comes unique challenges. As we marvel at their text generation capabilities, concerns about reliability and trust have surfaced. To tackle these issues head-on, I've delved into the innovative world of Retrieval Augmented Generation (RAG). Having spent years developing the production-grade LLM-based application Yarnit, RAG has been instrumental in my journey. I'm thrilled to announce my upcoming book on RAG with Manning Publications Co., a comprehensive guide tailored for technology professionals eager to harness the potential of LLMs. 📚 What You'll Learn: - Understand the fundamentals of Retrieval Augmented Generation - Build LLM-based applications from scratch - Navigate RAG pipelines from design to production 🔍 Chapters Include: - Large Language Models and the Need for RAG - Designing RAG-Enabled Systems - Creating Knowledge Bases with Indexing Pipelines - Real-Time Interaction and Contextual Responses - Evaluating RAG Performance - Technological Foundations for RAGOps - Evolution of RAG Systems: From Basic to Advanced - Nuances: Comparing RAG with Fine-Tuning and More - Cutting-Edge Best Practices and Further Exploration 🌟 Manning Early Access Program (MEAP) Benefits: - Immediate access to the book's draft and future updates - Opportunity to shape the final content with your feedback - Exclusive 50% discount until July 4th, 2024, with code "mlkimothi" Join me in exploring the forefront of AI innovation and mastering RAG. Let's navigate the complexities together and unlock the full potential of Large Language Models. 👉 Reserve your copy today and embark on your journey with RAG: https://lnkd.in/gHkBVW8W Thank you for your support and feedback as we dive into this transformative field. Feel free to drop your thoughts or questions about Manning Early Access Program (MEAP) below. Let's build the future of AI together! #AI #LLMs #RAG #MachineLearning #ManningPublications #MEAP #TechnologyInnovation #Techbook
Like Comment
To view or add a comment, sign in
Vaibhav Narute

"AI & Data Science Student | Exploring ML & Deep Learning | LLM & Gen Ai Enthusiast | Passionate About Advancing AI Technologies"
6mo Edited
Report this post
🚀 Excited to Announce: Implementing a Retrieval-Augmented Generation (RAG) Application with Hugging Face! 🚀 Today, I’m thrilled to share that I’ve started working on building a RAG (Retrieval-Augmented Generation) application, leveraging the power of Hugging Face’s advanced models and APIs. 🎉 💡 What makes RAG special? It combines the ability to: Retrieve relevant documents from large datasets. Generate meaningful responses using language models, enhanced by real-time information from the retrieved data. To achieve this, I’m utilizing Hugging Face’s API with a dedicated token for seamless model access and integration. This application will unlock more intelligent, data-driven outputs across multiple use cases—chatbots, research tools, and beyond! 🔑 Why Hugging Face? The Hugging Face ecosystem provides a rich set of pre-trained models and a robust infrastructure, making the development process faster and more scalable. 📅 I’m excited to share my journey of building this innovative tool. Stay tuned for updates, and I’d love to hear your thoughts on how RAG could revolutionize AI-driven applications! Github :- https://lnkd.in/dhDitbvY #AI #RAG #NLP #HuggingFace #MachineLearning #ArtificialIntelligence #LLM #DataScience #Innovation #Tech
1 Comment
Like Comment
To view or add a comment, sign in
Vahid Tavakkoli

Experienced Researcher & Engineer | Expert in AI, Deep Learning, and Industrial Automation | Postdoc Researcher | PhD in Information Technology
1mo
Report this post
🌟 Can We Finally Overcome Context Limitations in LLMs? 🌟 Large Language Models (LLMs) have transformed how we process and generate text, yet they face a persistent challenge: context limitations. Whether you're generating text, analyzing trends, or making sense of vast data, the inability to handle long-term context effectively has been a roadblock—until now. A recent breakthrough, as detailed in this paper (https://lnkd.in/df6KuS_C), introduces Titans, a new class of architectures that could redefine how LLMs handle context. The key innovation? A neural memory module that learns to store and retrieve long-term information, enabling models to access historical context far beyond their traditional limits. Unlike standard Transformers, which face quadratic complexity and fixed window sizes, Titans scale to context windows exceeding 2 million tokens! This makes them ideal for complex, long-range tasks like: +Language modeling with richer context retention +Genomics for analyzing massive sequences +Time-series analysis over extended periods The results are promising, with Titans outperforming both traditional Transformers and newer linear recurrent models, especially in tasks requiring nuanced long-term memory. 🌐✨ This breakthrough opens doors to LLMs that can reason across vast swathes of data without losing track of earlier details—imagine the potential for fields like legal research, scientific discovery, and beyond! #AI #MachineLearning #LLMs #DeepLearning #NeuralNetworks #Innovation #TechBreakthrough
3 Comments
Like Comment
To view or add a comment, sign in
Mahnoor Shoukat

AI Engr. @ Antematter | Gen AI | LLMs | Agents | NLP | Chatbots | Langchain
7mo
Report this post
I knew that humans have lats to be strong, but recently I learned that LLMs have LATS too, for the same purpose of making them stronger and better. 💪 I came across an article https://lnkd.in/dm_TRrw8. which introduced me to LATS — an innovative approach that significantly enhances the reasoning and decision-making capabilities of Large Language Models. ✨ It allows LLMs to plan, reason and act more effectively by leveraging external feedback by the integration of Monte Carlo Tree Search (MCTS) with LLM powered value function and self-reflection. 🚀 The main steps involved: 1️⃣ Selection: At first, it begins by determining which tree segment is best suited for growth. 2️⃣ Expansion: Executes n actions to expand the tree. This results in n new child nodes stored in an external long-term memory structure. 3️⃣ Evaluation: The third step is to evaluate or assigns a score to each child node for the next selection. The value function is calculated by reasoning about a given state. 4️⃣ Simulation: Expand the next node until a terminal state is reached. Feedback is calculated based on the success of the final state otherwise we have to follow two more steps. 5️⃣ Backpropagation: Similar to neural networks, it refines the tree structure. 6️⃣ Reflection: Stores both successful and failed trajectories for providing context to the agent and value function. This optimizes learning by integrating semantically meaningful experiences. LATS overcomes the limitations of previous tree-based methods like ToT prompting, which relied solely on internal reasoning. This approach allows LLMs to consider multiple reasoning paths and adapt to environmental conditions without additional training. 🦾 LATS has demonstrated SOTA performance across various domains, including programming, QA, web navigation, and mathematics, achieving a 92.7% accuracy on humaneval with gpt-4. 💡 Looking forward to seeing how this evolves! 🔎 #LLM #LATS #AI #Genai #Research
Like Comment
To view or add a comment, sign in

2,061 followers

52 Posts

View Profile Connect

LiveBench: A benchmark for LLMs by LiveBench

More Relevant Posts

Explore topics