Evaluating my fine tuned 3B model with LLM judge - many of you have asked so here goes :-). Tl;Dr - 91% accuracy vs 47% not-fine tuned; with only 140 examples and 3 minutes of training (https://lnkd.in/gHw993ZS) I started out with creating a new synthetic test data set that the fine tuned models haven't seen, using the same code from the training set (data here: https://lnkd.in/gcCSdT5e). I used Comet's Opik per Armand's recommendation (you should follow him if you aren't :-)). Then built my evaluations test for (1) JSON schema compliance, (2) distance from reference answer (3) the LLM Judge which scores first on json format, then on compliance with schema, then on entities being detected, then on not detecting more than necessary. I tried this out with a bunch of LLMs that could run on this laptop, as you can see the 3B fine tuned one performed the best, followed by the 1B, then all of the out of the box ones pretty similar and not so great. Code for you to enjoy: https://lnkd.in/guhzdUkf You'll need a Comet account, which is shockingly free. Enjoy :-) Next I'm going to try to beef up the data gen with Distilable and CrewAI, let me know if you find that interesting or have other ideas.
Great going and Merry Xmas! If it's not too much, could you list out or point me to your complete stack for the desktop LLM? Just trying to get all the big pieces.
Amazing results! Achieving 91% accuracy with 140 examples and 3 minutes is impressive. What’s your next step?
Impressive results! It’s amazing to see how fine-tuning with just a small dataset can lead to such a big accuracy boost. The use of synthetic test data and a structured evaluation process definitely seems like a smart approach for pushing the performance of the models.
This is wild! Thanks for sharing. Going to dig into this and model university soon!
Very informative
Thanks for sharing Shadi Copty
Love this 👍 thx!
Insightful
A top follow for AI, ontology, and cross-domain explorations.
2moDid you test how it stands up to a larger parameter model like the 70B that helped make the synthetic data? I know the intent is to have a model you can run locally, but I’m also curious how a base 3.3 and tuned 3.3 would compare if you followed the same steps. Claude 3.5 handles my entity extraction work like a champ but obviously it’s hundreds of billions more parameters in size and more $ to use