Machine Learning Artificial Intelligence (AI)

Creating Task-Specific ML Models & Enhancing their Real World Usage

6 min read

May 12, 2025

9:58

In today's rapidly evolving AI landscape, the ability to create machine learning (ML) models tailored to specific tasks is becoming increasingly critical. While generic, pre-trained models offer a valuable starting point, they often lack the precision and nuance required to address the unique challenges of specific applications. For instance, a general-purpose image recognition model might struggle to accurately identify subtle anomalies in medical images or to differentiate between various types of manufacturing defects. This necessitates the creation of custom solutions that are specifically designed and trained to excel in their intended applications.

Creating task specific models is an optimisation problem. Optimisation needs to be done starting from model training to model serving and inference. Large models having billions of parameters require huge computational infrastructure not just for training but for serving as well, leading to high energy consumption, higher latency during inference and increased cost of the service.

Two prominent techniques for building these task-specific models are Supervised Fine-Tuning (SFT) and Model Distillation. SFT adapts pre-trained models to new tasks by training them on labeled datasets, while Distillation transfers knowledge from large, complex models to smaller, more efficient ones. This article compares these two approaches, highlighting their strengths & weaknesses, and suitability for different scenarios, empowering you to make informed decisions about which method best suits your unique requirements.

Supervised Fine-Tuning (SFT): Adapting Pre-trained Models

SFT is a process of adapting a pre-trained machine learning model, like a large language model (LLM), to a specific task by training it on a labeled dataset. This involves broadly three steps. First, selection of a pre-trained LLM model most relevant for the task at hand. This helps in leveraging the broad features already learned during pre-training. Second, preparing task specific dataset which should be carefully cleaned, preprocessed, and labeled to ensure accuracy and consistency. Finally, fine-tuning the pre-trained model using the dataset. This approach significantly reduces the amount of data and computational resources required to train a high-performing model from scratch.

The fine tuning process itself has witnessed improvement from full fine tuning to Parameter efficient fine tuning (PEFT). Rather than changing all the parameters which is akin to full training, PEFT is a compute and time efficient mechanism and updates selected parameters while keeping the rest of the model frozen. Multiple PEFT techniques have evolved. QLoRA (Quantized Low Rank Adaptation) is a PEFT technique which incorporates advanced quantization techniques to improve efficiency of the fine tuning process. It converts model parameters into compact 4 bit representation (NF4) from 16 or 32 bit representations. This reduces compute and storage costs. Information loss during the quantisation process is minimised by leveraging statistical properties of normally distributed weights. Due to these optimisations, SFT can now be done on commercially available hardware. RLHF (Reinforcement learning through human feedback) post SFT can yield even better results.

SFT provides a low compute option to build a model trained for specific tasks. Of course the starting point is a pre-trained model. However, there can be challenges during SFT which should be tackled. The biggest risk during the SFT process is catastrophic forgetting wherein the model parameters are so adjusted that it forgets the previous learnings and is not able to perform other tasks. Hyperparameter tuning such as finding optimal learning rate and batch size can take a lot of time. Finally there are issues of data quality and overfitting while fine tuning the model.

SFT should be used when your goal is to achieve high performance on a well defined task and when you have access to high quality labeled dataset to perform fine tuning. It can be used to infuse your model with domain expertise such as in the legal or financial domain or for task specific customisation such as creating marketing copies for specialised products. You can also use SFT to ensure output of your model is in a specific format and tone such as during customer interaction to handle edge cases or to control when it should give crisp or detailed responses. SFT has been successfully applied in various domains building on top of Gemini models, Llama and GPT-4 for use cases like generating notes from doctor patient conversations, catering to customer support in more than 50 dialects, financial management, etc.

Model Distillation: Transferring Knowledge Efficiently

Model Distillation is essentially a knowledge transfer technique wherein a smaller, more efficient "student" model learns from a larger, complex "teacher" model. This technique is particularly useful when deploying models on resource-constrained devices or when reducing the computational cost of inference is a priority. Under the distillation technique, a state of the art teacher model is selected. This model generates two types of dataset. The first is called hard target which is labeled pairs of prompt and outputs and the second is called soft target which provides probability distribution of probable outputs rather than a single correct output. This results in the student model gaining a better understanding of the teacher’s decision making process.

The distillation process can be enhanced by techniques like data augmentation which involves generating a larger dataset by the teacher model to create a more inclusive dataset that improves generalisation and using an ensemble of multiple teacher models which results in more robust learning.

Distilled models provide lots of advantages. They are much smaller. GPT-4o consists of 200B parameters while GPT-4o mini has only 8B parameters. Thus, distilled models take less compute and storage. This results in lower latency which is highly desirable in real time use cases like customer support. This also results in lower cost of service. Gemini 1.5Flash is the distilled version of Gemini 1.5Pro and the cost difference between them is more than 15 times. Similar comparison can be seen between Gemini 2.5 Pro ($10 per Mn token for output) and Gemini 2.5 Flash ($0.6 per Mn token for output). It makes sense to go for distilled models where speed matters more than accuracy. And accuracy drops are high only at highly complex reasoning tasks. At other tasks, accuracy drops may not be noticeable. Smaller size also makes distilled models more accessible that can be deployed on edge devices and more broadly available for use cases from education, finance, healthcare and others.

Knowledge loss is one of the primary challenges of the distillation process. A smaller model cannot represent all the nuances of understanding contained in the larger model. While the final distilled model is small and easily deployable, the process of distillation itself requires high compute and storage infra. The student model’s capabilities are constrained by the teacher model’s capabilities. Hence, it becomes crucial to have a high quality teacher model, training which itself can be a challenging task. Extensive hyperparameter tuning is required to complete the distillation process.

SFT vs Distillation and Beyond

Both SFT and Distillation are optimisation techniques. SFT, specifically QLoRA allows fine tuning of the biggest of the models over low compute and storage infrastructure. However, these models still require high compute and storage for deployment as well as for inference. It is easier to train and adapt but remains large and cumbersome in real world usage. On the other hand, distillation results in creation of smaller models which require much lower compute and storage during deployment and inference. However, the process of distillation itself requires high infrastructural resources. Distilled models also have lower performance than those created through SFT.

Choice between these techniques essentially boils down to your objective. If you want a full scale model with high performance adapted to a specific task then go for SFT. Obtaining a high quality labeled dataset is also important in this case. On the other hand, if there is a resource constraint or you want to minimize the cost of running the model at scale in real time, go for the distillation process. This works when a high quality labeled dataset is not available.

In future, this may not be a question of choice. Researchers are working on combining the best of both to come up with hybrid techniques like KD-LoRa (Knowledge distillation with LoRA) which can produce both powerful and practical models. These have their own challenges. New strategies are required to overcome those.

Why Pythian?

Pythian focuses on data operations and curation techniques that bring your organization immediate visible benefits within the ever-evolving AI ecosystem. We believe that AI architecture has to be nimble and performant. Build once and use it with every LLM, including the latest one on the block. Pythian, as your AI partner, can help you prepare for the coming changes and protect and enhance your AI investment.

At Pythian we are staying ahead of the latest and greatest innovations in the world of AI. We have already launched multiple AI Offerings. Check out those to understand what makes sense to you. Or just talk to us today!