Shadi Copty’s Post

Sr Director Llama Partner Engineering @ Meta | Founder @ Minorio

2mo Edited

Evaluating my fine tuned 3B model with LLM judge - many of you have asked so here goes :-). Tl;Dr - 91% accuracy vs 47% not-fine tuned; with only 140 examples and 3 minutes of training (https://lnkd.in/gHw993ZS) I started out with creating a new synthetic test data set that the fine tuned models haven't seen, using the same code from the training set (data here: https://lnkd.in/gcCSdT5e). I used Comet's Opik per Armand's recommendation (you should follow him if you aren't :-)). Then built my evaluations test for (1) JSON schema compliance, (2) distance from reference answer (3) the LLM Judge which scores first on json format, then on compliance with schema, then on entities being detected, then on not detecting more than necessary. I tried this out with a bunch of LLMs that could run on this laptop, as you can see the 3B fine tuned one performed the best, followed by the 1B, then all of the out of the box ones pretty similar and not so great. Code for you to enjoy: https://lnkd.in/guhzdUkf You'll need a Comet account, which is shockingly free. Enjoy :-) Next I'm going to try to beef up the data gen with Distilable and CrewAI, let me know if you find that interesting or have other ideas.

19 Comments

Shep ⚡️ Bryan

A top follow for AI, ontology, and cross-domain explorations.

2mo

Did you test how it stands up to a larger parameter model like the 70B that helped make the synthetic data? I know the intent is to have a model you can run locally, but I’m also curious how a base 3.3 and tuned 3.3 would compare if you followed the same steps. Claude 3.5 handles my entity extraction work like a champ but obviously it’s hundreds of billions more parameters in size and more $ to use

4 Reactions

Prakash Muralidharan

AI Builder and Partner for Insurance companies AWS-Machine Learning certified, AWS Cloud Quest: Generative AI, AWS-Associate Architect

2mo

Great going and Merry Xmas! If it's not too much, could you list out or point me to your complete stack for the desktop LLM? Just trying to get all the big pieces.

1 Reaction

Odis Ureña

⤷ GenAI-Powered Growth Marketer | Scaling Lead Generation & Multichannel Success | Co-Founder Synthwave Solutions

2mo

Amazing results! Achieving 91% accuracy with 140 examples and 3 minutes is impressive. What’s your next step?

Brilworks Software

2mo

Impressive results! It’s amazing to see how fine-tuning with just a small dataset can lead to such a big accuracy boost. The use of synthetic test data and a structured evaluation process definitely seems like a smart approach for pushing the performance of the models.

1 Reaction

Shahid H.

Chief Operating Officer at Tilli Software | Ex - Accenture, Wipro

2mo

Sriram Kanchiraju Abdul Mohammed Michael Vuolo Ibrahim Ali

2 Reactions

John Milinovich

Head of GenAI Product at Canva

2mo

This is wild! Thanks for sharing. Going to dig into this and model university soon!

Alen Joses R

SDE II @Comcast | LLM | Gen AI Enthusiasts | Python | DevOps ♾️

2mo

Very informative

Shahid H.

Chief Operating Officer at Tilli Software | Ex - Accenture, Wipro

2mo

Thanks for sharing Shadi Copty

2 Reactions

Yaroslav Boiko

Senior Frontend Developer | React, Typescript

2mo

Love this 👍 thx!

CodeSurge AI

1mo

Insightful

See more comments

To view or add a comment, sign in

More Relevant Posts

David Robson

Healthcare Systems Integration | Data Analyst | HL7 Specialist | AI & LLM | Statistics | Biology
2mo
Report this post
Interesting approach to fine-tuning LLMs: achieving 91% accuracy with only 140 examples and 3 minutes of training!
Shadi Copty

Sr Director Llama Partner Engineering @ Meta | Founder @ Minorio
2mo Edited

Evaluating my fine tuned 3B model with LLM judge - many of you have asked so here goes :-). Tl;Dr - 91% accuracy vs 47% not-fine tuned; with only 140 examples and 3 minutes of training (https://lnkd.in/gHw993ZS) I started out with creating a new synthetic test data set that the fine tuned models haven't seen, using the same code from the training set (data here: https://lnkd.in/gcCSdT5e). I used Comet's Opik per Armand's recommendation (you should follow him if you aren't :-)). Then built my evaluations test for (1) JSON schema compliance, (2) distance from reference answer (3) the LLM Judge which scores first on json format, then on compliance with schema, then on entities being detected, then on not detecting more than necessary. I tried this out with a bunch of LLMs that could run on this laptop, as you can see the 3B fine tuned one performed the best, followed by the 1B, then all of the out of the box ones pretty similar and not so great. Code for you to enjoy: https://lnkd.in/guhzdUkf You'll need a Comet account, which is shockingly free. Enjoy :-) Next I'm going to try to beef up the data gen with Distilable and CrewAI, let me know if you find that interesting or have other ideas.
Like Comment
To view or add a comment, sign in
Claire L.

AI Executive | Data Science and AI at Comet | Startup Advisor | ex-Arize AI 📈 | ex-Twilio ☎️ | Advocate for women in tech 👯♀️
2mo
Report this post
Evaluating LLMs is complex. Shadi Copty used Opik to create a really cool custom eval for JSON format, schema compliance, and other specific issues for his use case. I love seeing how easy it is becoming to create really custom evaluation metrics. And he shared the open source code in his post too! Check it out and happy coding! 🚀
Shadi Copty

Sr Director Llama Partner Engineering @ Meta | Founder @ Minorio
2mo Edited

Evaluating my fine tuned 3B model with LLM judge - many of you have asked so here goes :-). Tl;Dr - 91% accuracy vs 47% not-fine tuned; with only 140 examples and 3 minutes of training (https://lnkd.in/gHw993ZS) I started out with creating a new synthetic test data set that the fine tuned models haven't seen, using the same code from the training set (data here: https://lnkd.in/gcCSdT5e). I used Comet's Opik per Armand's recommendation (you should follow him if you aren't :-)). Then built my evaluations test for (1) JSON schema compliance, (2) distance from reference answer (3) the LLM Judge which scores first on json format, then on compliance with schema, then on entities being detected, then on not detecting more than necessary. I tried this out with a bunch of LLMs that could run on this laptop, as you can see the 3B fine tuned one performed the best, followed by the 1B, then all of the out of the box ones pretty similar and not so great. Code for you to enjoy: https://lnkd.in/guhzdUkf You'll need a Comet account, which is shockingly free. Enjoy :-) Next I'm going to try to beef up the data gen with Distilable and CrewAI, let me know if you find that interesting or have other ideas.
Like Comment
To view or add a comment, sign in
Abby Morgan

AI/ML Growth Engineer @ Comet Opik | Technical Writer | Community Organizer | Mentor
2mo
Report this post
📈 Check it out: Shadi Copty fine-tuned a 3B param LLM for 3 minutes on only 140 examples and boosted his accuracy score from ~47% to 91%. 🔗 Check his original post for a link to his code and synthetic dataset, which he evaluated using Comet's Opik 🔥 #OpenSource #LLMs #GenerativeAI
Shadi Copty

Sr Director Llama Partner Engineering @ Meta | Founder @ Minorio
2mo Edited

Evaluating my fine tuned 3B model with LLM judge - many of you have asked so here goes :-). Tl;Dr - 91% accuracy vs 47% not-fine tuned; with only 140 examples and 3 minutes of training (https://lnkd.in/gHw993ZS) I started out with creating a new synthetic test data set that the fine tuned models haven't seen, using the same code from the training set (data here: https://lnkd.in/gcCSdT5e). I used Comet's Opik per Armand's recommendation (you should follow him if you aren't :-)). Then built my evaluations test for (1) JSON schema compliance, (2) distance from reference answer (3) the LLM Judge which scores first on json format, then on compliance with schema, then on entities being detected, then on not detecting more than necessary. I tried this out with a bunch of LLMs that could run on this laptop, as you can see the 3B fine tuned one performed the best, followed by the 1B, then all of the out of the box ones pretty similar and not so great. Code for you to enjoy: https://lnkd.in/guhzdUkf You'll need a Comet account, which is shockingly free. Enjoy :-) Next I'm going to try to beef up the data gen with Distilable and CrewAI, let me know if you find that interesting or have other ideas.
1 Comment
Like Comment
To view or add a comment, sign in
Rajiv Kedia

Technology Leader | AI-Powered Chatbots | Contact Center Automation | Conversational AI | Digital Transformation
1mo
Report this post
𝗛𝗼𝘄 𝗺𝘂𝗰𝗵 𝗶𝗺𝗽𝗮𝗰𝘁 𝗰𝗮𝗻 𝗮 𝘀𝗺𝗮𝗹𝗹, 𝗵𝗶𝗴𝗵-𝗾𝘂𝗮𝗹𝗶𝘁𝘆 𝗱𝗮𝘁𝗮𝘀𝗲𝘁 𝗵𝗮𝘃𝗲 𝗼𝗻 𝗮𝗻 𝗟𝗟𝗠? A recent example from 𝗠𝗲𝘁𝗮 showcases something extraordinary: increasing the accuracy of a small LLM by 𝟲𝟬% using just ~𝟭𝟰𝟬 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴 𝗲𝘅𝗮𝗺𝗽𝗹𝗲𝘀! This approach, highlighted in the "Improving Accuracy of LLM Applications" course by DeepLearning.AI, demonstrates how 𝗽𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗯𝗲𝗮𝘁𝘀 𝘃𝗼𝗹𝘂𝗺𝗲. Fine-tuning with a curated dataset, aligned perfectly with the model's use case, turned a 30% accuracy rate into 𝟵𝟱%--all in 𝟯𝟬 𝗺𝗶𝗻𝘂𝘁𝗲𝘀 and for just a few dollars! Why does this matter? • For businesses in niche domains with limited data, it opens the door to competitive AI applications. • It reinforces the power of 𝗵𝗶𝗴𝗵-𝗾𝘂𝗮𝗹𝗶𝘁𝘆, 𝘁𝗮𝘀𝗸-𝘀𝗽𝗲𝗰𝗶𝗳𝗶𝗰 𝗱𝗮𝘁𝗮 over massive, generic datasets. • It makes LLM solutions more accessible and efficient for real-world tasks. This is a game-changer for making AI development faster, cheaper, and more impactful. Want to dive deeper into this fascinating case? Click the link below to dive in! #AI #LLM #MachineLearning #FineTuning #DataQuality
Shadi Copty

Sr Director Llama Partner Engineering @ Meta | Founder @ Minorio
2mo Edited

Evaluating my fine tuned 3B model with LLM judge - many of you have asked so here goes :-). Tl;Dr - 91% accuracy vs 47% not-fine tuned; with only 140 examples and 3 minutes of training (https://lnkd.in/gHw993ZS) I started out with creating a new synthetic test data set that the fine tuned models haven't seen, using the same code from the training set (data here: https://lnkd.in/gcCSdT5e). I used Comet's Opik per Armand's recommendation (you should follow him if you aren't :-)). Then built my evaluations test for (1) JSON schema compliance, (2) distance from reference answer (3) the LLM Judge which scores first on json format, then on compliance with schema, then on entities being detected, then on not detecting more than necessary. I tried this out with a bunch of LLMs that could run on this laptop, as you can see the 3B fine tuned one performed the best, followed by the 1B, then all of the out of the box ones pretty similar and not so great. Code for you to enjoy: https://lnkd.in/guhzdUkf You'll need a Comet account, which is shockingly free. Enjoy :-) Next I'm going to try to beef up the data gen with Distilable and CrewAI, let me know if you find that interesting or have other ideas.
2 Comments
Like Comment
To view or add a comment, sign in
Kirk Byers
4mo
Report this post
Definitely getting frustrated with OpenAI Structured Output API--very low input token threshold so I hit API threshold quickly, anonymizing data on structured output parsing is not really going to work, only returning a small subset of the data (expect over 1000 entries, and get 50, but no errors). Will probably keep working on it--to see what is required to work around these issues...I switched to 4o-mini (which increases input token threshold ~7.5X).
Like Comment
To view or add a comment, sign in
Ulrik Sandholt

Innovating and building GenAI solutions. Marketing Automation in Zero seconds
2mo Edited
Report this post
OpenAi o3 model announced. Access expected early next year. Outperforms o1 by significant margin. This is getting closer to AGI with specific benchmarks of 87.5% score in the ARC-AGI-Pub score. It is only a matter of time and effort. The world will change faster and faster. So will you need to. 📼 see the video here: https://lnkd.in/dbtWHjEr
Like Comment
To view or add a comment, sign in
Vespa.ai

2,256 followers
9mo
Report this post
The latest Vespa newsletter is out - highlights: 👉 RAG Vespa now provides LLM inference support, you can now implement a retrieval-augmented generation (RAG) application entirely as a Vespa application. We have added a new sample application demonstrating RAG end-to-end on Vespa: - Generation using an external LLM like OpenAI - Running an LLM locally inside the Vespa application on CPU - Running an LLM inside the Vespa application on Vespa Cloud on GPU 👉 Fuzzy Search with Prefix Match A prefix search will match “Edvard Grieg” and “Edvard Gr”. A fuzzy search matches “Edvard Grieg” with “Edward Grieg”. From Vespa 8.337 you can combine the two to match “Edvard Grieg” and “Edward Gr”. Very powerful for query completion! 👉 Pyvespa Lots of new features, including a notebook that demonstrates how the mixedbread.ai rerank models (cross encoders) can be used for global phase reranking in Vespa. 👉 Vector search performance Up to 9x faster distance calculations! Improvements for euclidean, angular, hamming and dotproduct as well as for HNSW indexing. 👉 Embeddings Since Vespa 8.329, embed the data _once_ for multiple resolutions. Store low-res in memory, hi-res on disk to optimize for cost - then use two-phase ranking for low-latency search with high precision. Get started using Vespa with LlamaIndex! Check out the new Vespa Vector Store demo notebook. Deep dive into this and more features like 10x faster data migration in the latest Vespa Newsletter:

Vespa Newsletter, May 2024

blog.vespa.ai
Like Comment
To view or add a comment, sign in
Kristian Aune

Founder / Head of Customer Success, Vespa.ai - ex Yahoo
9mo
Report this post
Combining Fuzzy Search and Prefix search in the same query AND keeping the latency down is harder than you might think! It will be exciting to see applications with high query rates using this. The latest advances in embedding in multiple resolutions and multi-phase ranking will enable new use cases in many organizations:

Vespa.ai

2,256 followers
9mo

The latest Vespa newsletter is out - highlights: 👉 RAG Vespa now provides LLM inference support, you can now implement a retrieval-augmented generation (RAG) application entirely as a Vespa application. We have added a new sample application demonstrating RAG end-to-end on Vespa: - Generation using an external LLM like OpenAI - Running an LLM locally inside the Vespa application on CPU - Running an LLM inside the Vespa application on Vespa Cloud on GPU 👉 Fuzzy Search with Prefix Match A prefix search will match “Edvard Grieg” and “Edvard Gr”. A fuzzy search matches “Edvard Grieg” with “Edward Grieg”. From Vespa 8.337 you can combine the two to match “Edvard Grieg” and “Edward Gr”. Very powerful for query completion! 👉 Pyvespa Lots of new features, including a notebook that demonstrates how the mixedbread.ai rerank models (cross encoders) can be used for global phase reranking in Vespa. 👉 Vector search performance Up to 9x faster distance calculations! Improvements for euclidean, angular, hamming and dotproduct as well as for HNSW indexing. 👉 Embeddings Since Vespa 8.329, embed the data _once_ for multiple resolutions. Store low-res in memory, hi-res on disk to optimize for cost - then use two-phase ranking for low-latency search with high precision. Get started using Vespa with LlamaIndex! Check out the new Vespa Vector Store demo notebook. Deep dive into this and more features like 10x faster data migration in the latest Vespa Newsletter:

Vespa Newsletter, May 2024

blog.vespa.ai
Like Comment
To view or add a comment, sign in
Mastering LLM (Large Language Model)

47,935 followers
4mo
Report this post
Get {50%} 𝗢𝗙𝗙 (𝗖𝗼𝗱𝗲 - LLM50) on our 𝗟𝗟𝗠 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗣𝗿𝗲𝗽 𝗖𝗼𝘂𝗿𝘀𝗲 -https://lnkd.in/grTzEtpH =========================== 🚀 𝗠𝗲𝗺𝗼𝗥𝗔𝗚: 𝗔 𝗥𝗲𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻𝗮𝗿𝘆 𝗥𝗔𝗚 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝗘𝗻𝗵𝗮𝗻𝗰𝗲𝗱 𝗘𝘃𝗶𝗱𝗲𝗻𝗰𝗲 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹! MemoRAG is transforming the way we approach retrieval-augmented generation (RAG) by incorporating a super-long memory model for global understanding across massive datasets. 🌐 Unlike traditional RAG frameworks, MemoRAG doesn’t just focus on explicit queries. Instead, it taps into its global memory to recall query-specific clues, creating responses that are not only more accurate but also contextually rich. 🔥 𝗞𝗲𝘆 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀 🔥 1. 𝗚𝗹𝗼𝗯𝗮𝗹 𝗠𝗲𝗺𝗼𝗿𝘆: Processes up to 1 million tokens in a single context, offering unmatched depth and breadth across data. 2. 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝗯𝗹𝗲 & 𝗙𝗹𝗲𝘅𝗶𝗯𝗹𝗲: Fine-tune and adapt MemoRAG to new tasks with only a few hours of training! 3. 𝗖𝗼𝗻𝘁𝗲𝘅𝘁𝘂𝗮𝗹 𝗖𝗹𝘂𝗲𝘀: Bridges raw inputs to answers using clues derived from memory—unlocking insights from even the most complex queries. 4. 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗖𝗮𝗰𝗵𝗶𝗻𝗴: Up to 30x faster context pre-filling through advanced caching, chunking, indexing, and encoding. 5. 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗥𝗲𝘂𝘀𝗲: Long contexts can be encoded once and reused, improving efficiency for tasks with repetitive data needs. 🆕 𝗟𝗶𝘁𝗲 𝗠𝗼𝗱𝗲 𝗼𝗳 𝗠𝗲𝗺𝗼𝗥𝗔𝗚 🆕 You can now experience MemoRAG’s powerful pipeline with just a few lines of code! Ideal for GPUs with 16GiB or 24GiB memory, the Lite Mode simplifies getting started while maintaining exceptional performance. ✨ 𝗕𝗮𝘀𝗶𝗰 𝗨𝘀𝗮𝗴𝗲 & 𝗠𝗼𝗱𝗲𝗹 𝗖𝗼𝗺𝗽𝗮𝘁𝗶𝗯𝗶𝗹𝗶𝘁𝘆 MemoRAG works seamlessly with HuggingFace models, using the MemoRAG.memorize() method to build global memory across long input contexts. 🧠 𝗟𝗼𝗻𝗴 𝗟𝗟𝗠𝘀 𝗮𝘀 𝗠𝗲𝗺𝗼𝗿𝘆 𝗠𝗼𝗱𝗲𝗹𝘀 🧠 MemoRAG also supports long-context LLMs like Meta-Llama-3.1-8B-Instruct and Llama3.1-8B-Chinese-Chat, optimizing memory through MInference. Check out the provided notebooks and scripts for detailed usage and unlock the power of MemoRAG! 💡 #MemoRAG #RAGFramework #LLMs #GlobalMemory #AI #NLP #MInference #GenerativeAI #MasteringLLM
Like Comment
To view or add a comment, sign in
Ramakrishna Thirupathi

Head, Data Science @ Providence India | Driving AI/ML Solutions, Mentoring Professionals | IIT DELHI
5mo
Report this post
Key LLM hyperparameters for GenAI to be creative or focused?
Dipanjan S.

Head of Community • Principal AI Scientist • Google Developer Expert & Cloud Champion Innovator • Author
5mo

Don't just use default settings in LLMs! Check out these 7 important LLM parameters which can help you in controlling generated responses. A detailed article is also shared below with hands-on code and examples of each of these settings! Let's understand these parameters: - max_tokens: limits the number of tokens in the response - temperature: lower values make it more consistent and higher values more creative - top_p & top_k: thresholds which can be used to select the next generated token from a subset of most probable next words at each step - frequency penalty: helps reduce same repeated tokens in response - presence penalty: encourages the model to add new diverse tokens in response - stop: standard tokens used to stop response generation The names of these parameters and typical ranges here are based on OpenAI LLM APIs, you might find slightly different names or ranges in other platforms like Huggingface but they pretty much mean and do the same thing. Do check out the article below with more details on Analytics Vidhya and share if useful!
Like Comment
To view or add a comment, sign in

9,619 followers

View Profile Connect

Shadi Copty’s Post

More from this author

Leading Magical Teams

No, your engineers are not lazy - you have a lazy product strategy

My aspirations for our product management team

Explore topics