Machine Learning Operations (MLOps) is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. It combines the principles of DevOps with machine learning to streamline the process of taking ML models from development to production. This article will provide a comprehensive guide to building an end-to-end MLOps pipeline.
Introduction to MLOps
MLOps bridges the gap between machine learning model development and its operationalization. It ensures that models are scalable, maintainable, and deliver value consistently. The primary goal of MLOps is to automate the machine learning lifecycle, integrating with existing CI/CD frameworks to enable continuous delivery of ML-driven applications.
It's a set of practices and tools that streamline the journey from model development to deployment, addressing key challenges such as:
- Ensuring reproducibility in data preprocessing and model training.
- Managing model versions effectively.
- Deploying models efficiently and safely.
- Monitoring model performance in production environments.
Building an End-to-End MLOps Pipeline: A Practical Guide
In this project, we're going to build an end-to-end MLOps pipeline, demonstrating how these practices work in real-world scenarios.
1. Our Objectives Are to
- Identify a problem statement and gather relevant data
- Preprocess the data and develop a robust machine-learning model through hyperparameter tuning
- Implement version control for both data and model using Git and DVC
- Utilize MLflow for model registration
- Develop CI/CD workflows for model reports
- Create an Interface to access the trained Model using FASTAPI
- And Lastly, Dockerize the project
This is an flow of project to get an overview:
End-to-End MLOps Pipeline2. Problem Statement
The goal of this project is to predict the academic risk of students in higher education. This problem statement is derived from an active competition on Kaggle, providing a real-world context for our MLOps implementation.
3. Description of the Dataset
Let's start with a description of our data, as it forms the foundation of any machine learning project.
The dataset originated from a higher education institution and was compiled from several disjoint databases. It contains information about students enrolled in various undergraduate programs, including agronomy, design, education, nursing, journalism, management, social service, and technologies. The data encompasses:
- Information known at the time of student enrollment (academic path, demographics, and socio-economic factors)
- Students' academic performance at the end of the first and second semesters
The dataset is structured and labeled, with most columns being label-encoded. The target variable is formulated as a three-category classification task:
This classification is determined at the end of the normal duration of the course.
For a more detailed description of the dataset attributes, please refer to
Predict Students' Dropout and Academic Success
Initial Data Exploration and Insights: The dataset comprises 76,518 rows and 38 columns. All attributes are of integer or float data types, except for the target variable, which is an object type.
Key observations:
The target variable is imbalanced:
- Graduate: 36,282 rows
- Enrolled: 14,940 rows
- Dropout: (remaining rows)
Other fields also show imbalances, as revealed by univariate analysis
We will initially work with this imbalanced dataset and address the balance issue in later stages of our pipeline. In the next section, we'll dive into our data preprocessing steps and begin building our MLOps pipeline.
4. Staring With Preprocessing the Data
After our initial exploration, we moved on to preparing our data for modeling. Here's a detailed look at our preprocessing steps:
- Handling Missing Values Fortunately, our dataset didn't contain any missing values, which simplified our preprocessing pipeline.
- Feature Selection We removed the 'id' column as it doesn't contribute to our predictive model:
Python
X = df.drop(columns=[TARGET, 'id'])
y = df[TARGET
Feature Encoding We applied different encoding techniques based on the nature of our features:
1. One-Hot Encoding We used one-hot encoding for the 'Course' column to convert categorical course names into numerical column features to able to understand by machine:
Python
course_column = ['Course']
2. Label Encoding For our target variable, we applied label encoding:
Python
le = LabelEncoder()
y_encoded = le.fit_transform(y)
- Feature Scaling We standardized all numerical columns using StandardScaler:
- Preprocessing Pipeline We created a preprocessing pipeline using sklearn's ColumnTransformer to ensure consistent application of our preprocessing steps:
Python
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_columns),
('course', OneHotEncoder(handle_unknown='ignore'), course_column)
],
remainder='passthrough'
)
X_preprocessed = preprocessor.fit_transform(X)
This pipeline standardizes numerical features, one-hot encodes the 'Course' column, and passes through the remaining categorical columns.
By creating this preprocessing pipeline, we ensure that all our transformations are applied consistently across training and test sets, and can be easily reproduced in production environments. This is a crucial aspect of MLOps, as it maintains consistency between model development and deployment stages.
5. Model Selection and Training
After preprocessing our data, we moved on to the crucial steps of model selection and training. Our approach involves training multiple models to compare their performance and select the best one for our task.
Data Splitting: We begin by splitting our preprocessed data into training and testing sets. To ensure reproducibility, we use parameters defined in our params.yaml file:
Python
X_train, y_train = make_X_y(dataframe=train_data, target_column=TARGET)
This function reads the random state and split ratio from our configuration file, allowing us to easily adjust these parameters without changing our code.
1. Model Selection
Now created a comprehensive list of models to evaluate for our classification task. These models are defined in our modelslist.py file for easy access and modification:
Each model is initialized with parameters specified in our params.yaml file, allowing for easy hyperparameter tuning:
Python
models = {
'RandomForest': RandomForestClassifier(**params.get('train_model', {}).get('random_forest', {})),
'LogisticRegression': LogisticRegression(**params.get('train_model', {}).get('logistic_regression', {})),
'SVC': SVC(**params.get('train_model', {}).get('svc', {})),
'DecisionTree': DecisionTreeClassifier(**params.get('train_model', {}).get('decision_tree', {})),
'GradientBoosting': GradientBoostingClassifier(**params.get('train_model', {}).get('gradient_boosting', {})),
'AdaBoost': AdaBoostClassifier(**params.get('train_model', {}).get('adaboost', {})),
'KNN': KNeighborsClassifier(**params.get('train_model', {}).get('knn', {})),
'GaussianNB': GaussianNB(**params.get('train_model', {}).get('gaussian_nb', {}))
}
2. Model Training and Evaluation
We then iterate through our list of models, training each one on our preprocessed data:
Python
logging.info(f'{model_name} is training...')
trained_model = train_model(model=model, X_train=X_train, y_train=y_train)
After training, we immediately evaluate each model's performance:
Python
accuracy, f1, precision, recall = evaluate_model(model=trained_model, X_test=X_train, y_test=y_train)
We calculate key metrics including accuracy, F1 score, precision, and recall. These metrics give us a comprehensive view of each model's performance, allowing us to make an informed decision about which model to select for deployment.
3. Model Saving
Finally, we save each trained model for future use:
Python
save_model(model=trained_model, save_path=model_output_path_)
This step is crucial in our MLOps pipeline, as it allows us to version our models and easily deploy or rollback as needed.
By systematically training and evaluating multiple models, we can identify the best performing model for our specific task of predicting academic risk. In the next section, we'll dive deeper into our model evaluation results and discuss how we select the final model for deployment.
6. Hyperparameter Tuning
After our initial model training, we move on to one of the most crucial steps in machine learning: hyperparameter tuning. This process helps us optimize our models' performance by finding the best combination of hyperparameters.
1. Setting Up MLflow for Experiment Tracking
Lets begin by setting up MLflow, a powerful tool for tracking our experiments:
Python
def setup_mlflow():
start_mlflow_server()
mlflow.set_tracking_uri("http://localhost:5000")
experiment_name = "Hyperparameter Tuning"
MLflow allows us to log our hyperparameters, metrics, and models, making it easy to compare different runs and reproduce our results.
2. Models and Hyperparameters
We focus on tuning two most accuracy models get from above training:
- Random Forest Classifier
- Gradient Boosting Classifier
For each model, we define a set of hyperparameters to tune:
Python
models_to_tune = {
'RandomForest': (RandomForestClassifier(), {
'n_estimators': [500],
'max_depth': [10, None],
'min_samples_split': [10],
'min_samples_leaf': [1],
'max_features': ['sqrt']
}),
'GradientBoosting': (GradientBoostingClassifier(), {
'n_estimators': [400],
'learning_rate': [0.1],
'max_depth': [4],
'min_samples_split': [2],
'min_samples_leaf': [1],
'max_features': ['sqrt']
})
}
3. Hyperparameter Tuning Process
We use RandomizedSearchCV for our hyperparameter tuning, which randomly samples from the parameter space:
Python
def hyperparameter_tuning(model, param_dist, X_train, y_train, X_val, y_val, n_iter=100, cv=5):
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=n_iter, cv=cv,
scoring='f1_weighted', n_jobs=-1, verbose=2, random_state=9)
random_search.fit(X_train, y_train)
We save the best model for each type:
Python
mlflow.sklearn.log_model(best_model, f"{model_name}_best", signature=signature)
4. Selecting the Best Models
After tuning, we select the two best-performing models based on their F1 scores:
Python
tuned_model_results.sort(key=lambda x: x[1], reverse=True)
best_tuned_model1 = tuned_model_results[0][0] if len(tuned_model_results) > 0 else None
best_tuned_model2 = tuned_model_results[1][0] if len(tuned_model_results) > 1 else None
These top two models are then saved for further use in our pipeline.
By implementing this rigorous hyperparameter tuning process, we ensure that our models are optimized for our specific task of predicting academic risk. The use of MLflow for experiment tracking allows us to easily compare different runs and select the best-performing models.
7. Model Evaluation
After hyperparameter tuning, we move on to the crucial step of evaluating our best model and generating predictions for the test set. This process ensures that our model performs well on unseen data and prepares us for submission.
1. Loading the Best Model
We start by loading our best-tuned model, which was selected based on its performance during hyperparameter tuning:
Python
model_name = best_tuned_model1 + "_tuned.joblib"
model_path = root_path / 'models' / 'tuned_models' / model_name
model = joblib.load(model_path)
We also load the Preprocessor.joblib used during preprocessing to ensure consistent column transformation:
Python
preprocessor_path= root_path / 'models' / 'transformers' /’preprocessor.joblib'
preprocessor= joblib.load(preprocessor_path)
2. Evaluation on Validation Set
We evaluate our model on the validation set to get a final assessment of its performance:
Python
def evaluate_and_log(model, X, y, dataset_name):
y_pred = get_predictions(model, X)
accuracy, f1, precision, recall = calculate_metrics(y, y_pred)
logging.info(f'\nMetrics for {dataset_name} dataset:')
logging.info(f'Accuracy: {accuracy:.4f}')
logging.info(f'F1 Score: {f1:.4f}')
logging.info(f'Precision: {precision:.4f}')
logging.info(f'Recall: {recall:.4f}')
# Evaluate on validation set
val_data = load_dataframe(val_path)
X_val, y_val = make_X_y(val_data, TARGET)
evaluate_and_log(model, X_val, y_val, "Validation")
This step provides us with key performance metrics (accuracy, F1 score, precision, and recall) on our validation set, giving us confidence in our model's generalization ability.
By following this structured approach to model evaluation and prediction, we ensure that our MLOps pipeline not only produces a well-tuned model but also generates reliable predictions for real-world use. The logging of performance metrics and prediction on validation set are key steps in maintaining transparency and reproducibility in our machine learning workflow.
8. Visualization and Results Analysis
After model evaluation and prediction, it's crucial to visualize our results to gain deeper insights into our model's performance and the dataset characteristics. We've created several informative visualizations to help us understand our model better.
Setting Up: We start by loading our validation data, test predictions, and the best-tuned model.
Confusion Matrix
We visualize the confusion matrix to understand our model's performance across different classes:
Python
def plot_confusion_matrix(y_true, y_pred, model_name, plot_dir):
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'Confusion Matrix - {model_name}')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.savefig(plot_dir / f'{model_name}_confusion_matrix.png')
plt.close()
Output:
Confusion Matix of best modelThis visualization helps us identify class 3 our model predicts well and where it tends to make mistakes.
Feature Importance
For models that support it, we plot feature importance to understand which features are most influential in our predictions:
Python
def plot_feature_importance(model, X, model_name, plot_dir):
if hasattr(model, 'feature_importances_'):
importances = model.feature_importances_
feature_importance = pd.DataFrame({'feature': X.columns, 'importance': importances})
feature_importance = feature_importance.sort_values('importance', ascending=False)
plt.figure(figsize=(12, 10))
sns.barplot(x='importance', y='feature', data=feature_importance.head(20))
plt.title(f'Top 20 Feature Importance - {model_name}')
plt.tight_layout()
plt.savefig(plot_dir / f'{model_name}_feature_importance.png')
plt.close()
Output:
Top Features of modelThis plot helps us identify ‘Curricular units 2nd sem(approved) features are driving our model's decisions, which can be valuable for feature selection and model interpretation.
9. Continuous Integration with CML
In our MLOps pipeline, Continuous Integration (CI) plays a crucial role in automating the process of model training, evaluation, and reporting. We use GitHub Actions along with CML (Continuous Machine Learning) to achieve this. Here's how our CI pipeline works:
XML
name: CI using CML
on:
push
jobs:
build:
name: build
runs-on: ubuntu-latest
steps:
- name: Checkout Repository
uses: actions/checkout@v4
- name: Install Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- uses: iterative/setup-cml@v2
- name: Install Dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Create CML Report/Graphs
env:
REPO_TOKEN: ${{ secrets.CML_TOKEN }}
run: |
echo "# Model Evalutaion Results" >> report.md
echo "## Bar Graph for Cross Val Scores" >> report.md
echo "" >> report.md
cml comment create report.md
This sets up CML, which we'll use for creating a markdown report with our model evaluation results. It includes:
- A title for the report
- A subtitle for the cross-validation scores graph
- An embedded image of our results plot
- The CML command to create a comment with this report
The REPO_TOKEN environment variable is set using a secret token, which allows CML to post comments to our repository.
This CI pipeline ensures that every time we push changes to our repository:
- Our code is automatically checked out
- The necessary environment is set up
- Our model is re-trained and evaluated
- A report with the latest results is generated and posted as a comment
This automation is crucial in MLOps as it allows us to continuously monitor our model's performance as we make changes to our code or data. It provides immediate feedback on how our changes affect model performance, enabling faster iteration and more robust model development.
10. Model Deployment with FastAPI
After training, tuning, and evaluating our model, the next crucial step in our MLOps pipeline is deploying the model to make it accessible for real-time predictions. For this project, we've chosen to use FastAPI, a modern, fast (high-performance) web framework for building APIs with Python.
- We start by importing the necessary libraries and setting up our FastAPI application. It is based on flask or inspired by flask.
- We then initialize our FastAPI app and mount a static folder for serving HTML content:
Python
app = FastAPI()
app.mount("/static", StaticFiles(directory="static"), name="static")
Defining API Endpoints
We define two main endpoints:
1. A home route that serves an HTML page:
Python
@app.get('/')
def home():
return HTMLResponse(content=open('static/index.html').read(), status_code=200)
2. A predictions route that accepts POST requests with input data and returns predictions:
Python
@app.post('/predictions')
def do_predictions(test_data: PredictionDataset):
try:
data_dict = test_data.dict(by_alias=True)
X_test = pd.DataFrame([data_dict])
predictions = model_pipe.predict(X_test)
predictions = predictions.tolist()
return {"predicted_academic_success_score": predictions[0]}
except Exception as e:
logging.error(f"Prediction error: {e}")
raise HTTPException(status_code=500, detail="Internal Server Error")
This endpoint uses our PredictionDataset Pydantic model to validate incoming data, processes it through our pipeline, and returns the prediction.
Running the Application
Finally, we set up the application to run using Uvicorn:
Python
if __name__ == "__main__":
uvicorn.run(app="app:app", host="0.0.0.0", port=8000)
There is index.html an form that takes input from user and pass to transformer to transform in desired attributes and get prediction done by best model and a javascript function that post ‘/predictions’ and show prediction on same page.
Benefits of This Approach
- Fast and Efficient: FastAPI is designed for high performance, making it suitable for production deployments.
- Easy to Use: The framework provides intuitive decorators and type hints, making the code clean and easy to understand.
- Automatic Documentation: FastAPI automatically generates OpenAPI (Swagger) documentation for our API.
- Data Validation: By using Pydantic models, we ensure that incoming data is validated before processing.
- Error Handling: We've implemented proper error handling to catch and log any issues during prediction.
This deployment setup allows us to serve our model predictions via a RESTful API, making it easy to integrate with various applications or services
11. Dockerization
In the final stages of our end-to-end MLOps project, we successfully integrated FastAPI into our machine learning pipeline to create a robust, scalable web application. This section delves into the Docker setup we used to containerize our FastAPI application, ensuring that it is both portable and easy to deploy.
1. Dockerfile Configuration
A key component of our deployment strategy was the creation of a Dockerfile, which defines the environment for our FastAPI application.
XML
# Use an official Python runtime as a parent image
FROM python:3.9-slim
# Set the working directory in the container
WORKDIR /app
# Copy specific files and directories into the container
COPY app.py /app/
COPY params.yaml /app/
COPY models/tuned_models /app/models/tuned_models
COPY models/transformers /app/models/transformers
COPY static/ /app/static/
COPY data_models.py /app/
COPY src/logger.py /app/src/
COPY src/models/models_list.py /app/src/models/
# Copy the requirements file
COPY requirements.txt /app/
# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
# Make port 8000 available to the world outside this container
EXPOSE 8000
# Set the entry point to run the app
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
2. Building and Running the Docker Container
With the Dockerfile set up, we used the following commands to build and run our Docker image, these are run as stages of dvc :
XML
build_docker_image:
cmd: |
docker build -t academic-success-predictor .
deps:
- Dockerfile
- requirements.txt
run_docker_container:
cmd: |
docker run --rm academic-success-predictor
We run the Docker container from the built image. The --rm flag ensures that the container is removed after it stops, keeping our environment clean.
Output after running docker ImageKey Benefits of Docker:
- Consistent Development Environments
- Streamlined Deployment Process
- Improved Development Workflow
- Portability Across Different Platforms
- Efficient Continuous Integration and Continuous Deployment (CI/CD)
- Better Collaboration and Sharing
Conclusion
This project illustrated the end-to-end MLOps process, from problem identification to model deployment and monitoring. Each stage of the pipeline, including data preprocessing, model training, version control, and deployment, was executed to create a robust and maintainable machine learning solution.