"We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench)." https://lnkd.in/g9mSgaRC).
Sergei RYBALKO’s Post
More Relevant Posts
-
Updated 📣 Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone 💥 Updated phi-3 tech report with final numbers for 7B/14B and a new section on phi-3-V (e.g., MMMU at 40.4, in the ballpark of Claude 3-haiku and Gemini-1.0 pro)! Abstract We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench). Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts. 👉 https://lnkd.in/gAMAxXBN #machinelearning
To view or add a comment, sign in
-
-
Updated 📣 Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone 💥 Updated phi-3 tech report with final numbers for 7B/14B and a new section on phi-3-V (e.g., MMMU at 40.4, in the ballpark of Claude 3-haiku and Gemini-1.0 pro)! Abstract We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench). Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts. 👉 https://lnkd.in/gAMAxXBN #machinelearning
To view or add a comment, sign in
-
-
Phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data. https://lnkd.in/dbSu2fzB
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
arxiv.org
To view or add a comment, sign in
-
#Microsoft launched Phi-3 - series of lightweight AI models 🤖 designed to deliver the capabilities of large language models in a more accessible and efficient package 📦 . It's perfect for developers and businesses looking to integrate AI without the heavy computational costs. 📏 Available Versions: ➡ Phi-3 Mini: 𝟑.𝟖𝐁 parameters ➡ Phi-3 Small: 𝟕𝐁 parameters (coming soon) ➡ Phi-3 Medium: 𝟏𝟒𝐁 parameters (coming soon) Phi-3 Mini matches benchmarks of models ten times its size like GPT-3.5. Ideal for mobile applications, personal devices, and any platform where efficiency is key. If your strategy is about smaller models Phi-3 might be a good fit for you. here's the link to the full paper: https://lnkd.in/dEx7bcmN #AI #Innovation #Technology #MicrosoftAI #Phi3Mini #ArtificialIntelligence
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
arxiv.org
To view or add a comment, sign in
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper: https://lnkd.in/g9fcAe3u Abstract: "We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench)."
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
arxiv.org
To view or add a comment, sign in
-
Large language models (LLMs) like GPT-4o or Gemini have showcased impressive capabilities. However, they are inherently limited by their static training data, making them less effective for users seeking current or highly specific information. This is where Retrieval Augmented Generation (RAG) steps in. What is RAG? RAG enhances LLMs by retrieving facts from external sources, like Wikipedia or private documents, enabling chatbots to answer questions by referencing authoritative knowledge. Advantages of RAG: Up-to-date Answers: RAG uses the latest information from external sources, not just static training data. Reduced Hallucination: Grounding LLMs with external knowledge minimizes inaccurate responses. Authoritative Answers: Access to authoritative sources, including private records, ensures accurate responses. Cost-Effective: Specific external resources reduce infrastructure costs and simplify development. Dynamic Responses: RAG responds based on the most current data. Limitations of RAG: Lack of Iterative Reasoning: RAG may retrieve relevant documents but struggle with complex, nuanced queries requiring human-like reasoning. Data Organization Matters: The effectiveness of RAG relies on well-organized and properly tagged data. Poorly structured data hinders accurate retrieval. Image source Sohrab Rahimi Ph.D. #AI #MachineLearning #Chatbots #RAG #MetaAI #Innovation
To view or add a comment, sign in
-
-
There is a crucial concept in the world of Language Models (LLMs) called RAG (Retrieval-Augmented Generation). Let's break it down in simple terms. Imagine a GPT (similar to ChatGPT) that can address any question. However, if this GPT was trained on data up to 2 years ago, it might incorrectly identify "Kareem Abdul Jabbar" as the NBA player with the highest number of points in their name, instead of the correct answer, Lebron James. This discrepancy arises from the timeline of the data it was trained on. To ensure accurate responses, we employ RAG. By providing the correct context (e.g., the information about Lebron James) alongside the prompt, the GPT will generate the accurate response: "Lebron James." But why do we need to provide context if we already know the answer? The challenge lies in scenarios where the context is unknown. This is where Retrieval Strategies play a vital role. In summary, RAG operates through the following steps: 1. Query Input: Begins with a user query or prompt. 2. Document Retrieval: Searches a database for relevant documents. 3. Contextual Integration: Integrates retrieved documents to enhance the model's response. 4. Response Generation: Combines factual accuracy with language generation capabilities to produce a coherent response. #LLM #Watsonx #RAG #GPT
To view or add a comment, sign in
-
It's crazy to think that the Internet already contains a significant amount of AI-generated content and if people keep scraping the web for pre-training language models, the newer LLMs will be learning a lot from the data generated by the current LLMs. This seems like an infinite loop and even if we consider supervised fine-tuning and RLHF (both not so easy to scale), the model's vocabulary will still be limited to the data it has seen in pretraining so observing momentous changes in the improvement of these models is difficult (GPT-3.5 to GPT-4o). In my opinion, the bottleneck is text generation as there is only so much we can do with text. There are 2 ways I can think of to mitigate this issue - 1. Remarkable breakthrough in transformer technology. 2. Making AI systems truly multimodal with real-world experiences. Would love to hear what you guys think of this.
To view or add a comment, sign in
-
Ever wondered just how much information a Large Language Model (LLM) can hold in a single conversation? This limit, known as "context length," defines how much detail you can pack into one prompt for a richer, more insightful response. Though context length may seem like it's simply the word count, LLMs actually measure input through token length. Typically, a token corresponds to about four characters in English, or roughly ¾ of a word, so 100 tokens amount to around 75 words. As of today, here’s the context length for several popular pre-trained LLMs: Llama: 2K tokens Llama 2: 4K tokens GPT-3.5-turbo: 4K tokens GPT-4: 8K tokens Mistral 7B: 8K tokens Palm-2: 8K tokens Gemini: 32K tokens For those working with Generative AI applications, it's essential to monitor token usage in each model interaction, as this can directly impact pricing, especially for hosted models.
To view or add a comment, sign in