Machine Downtime Prediction
Machine Downtime Prediction
import optuna
c:\Users\Administrator\anaconda3\envs\machineind\lib\site-packages\tqdm
\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and
ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_instal
l.html (https://ipywidgets.readthedocs.io/en/stable/user_install.html)
from .autonotebook import tqdm as notebook_tqdm
try:
feature_names = preprocessor.get_feature_names_out()
except AttributeError:
feature_names = X_train.columns # Fallback if `get_feature_names_
2021- Makino-L2-
0 Shopfloor-L2 4.5 47.9
12-08 Unit1-2015
2021- Makino-L2-
1 Shopfloor-L2 21.7 47.5
12-17 Unit1-2015
2021- Makino-L1-
2 Shopfloor-L1 5.2 49.4
12-17 Unit1-2013
2021- Makino-L1-
3 Shopfloor-L1 24.4 48.1
12-17 Unit1-2013
2021- Makino-L2-
4 Shopfloor-L2 14.1 51.8
12-21 Unit1-2015
Preprocessing
we have to divide the numeric columns into those that are skewed and those that are normal in
order to be able to apply the necessary standardization or normalization to avoid bias
Cross Validation
Since our problem is a classification task, Stratified K-Fold (StratifiedKFold) will be use for the
cross validation.
Preserves Class Distribution: Stratified K-Fold ensures that each fold maintains the same
proportion of classes as the overall dataset, which is crucial when dealing with
classification problems, even if there is no visible class imbalance.
More Reliable Performance Estimates: It provides a more stable and representative
estimate of your model’s performance compared to ShuffleSplit, which may produce folds
with different class distributions.
localhost:8888/notebooks/Documents/Data Science Projects/Machine-Downtime-Prediction/notebook/Machine_downtime_ML_model.ipynb# 5/17
3/3/25, 6:42 PM Machine_downtime_ML_model - Jupyter Notebook
Better Generalization: Ensures that all classes are well represented in training and
validation splits, reducing the risk of biased results.
Precision: Measures how many of the predicted failures were actually failures. A high
precision means fewer false positives.
Recall: Measures how many of the actual failures were correctly identified. A high recall
means fewer false negatives.
F1-Score: Harmonic mean of precision and recall, balancing both. Higher is better.
ROC AUC: Measures the model’s ability to distinguish between classes. A value closer to
1 is better.
Observations
In [8]: model_results_df.head(10)
Hyperparameter Tuning
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', model)
])
pipeline.fit(X_train_fold, y_train_fold)
y_pred = pipeline.predict(X_val_fold)
y_prob = pipeline.predict_proba(X_val_fold)[:, 1] if hasattr(model
f1_scores.append(f1_score(y_val_fold, y_pred))
precision_scores.append(precision_score(y_val_fold, y_pred))
recall_scores.append(recall_score(y_val_fold, y_pred))
roc_auc_scores.append(roc_auc_score(y_val_fold, y_prob))
Maintains excellent performance on the test set, with a very high ROC AUC and F1-Score.
This indicates strong generalization ability, meaning it's likely to perform well on new,
unseen data.
Performs very well on the test set, with a high ROC AUC and F1-Score.
While slightly behind XGBoost and Gradient Boosting, it's still a strong model.
Observations
All three models generalize well to the test set, confirming their strong performance observed
during training and validation. Gradient Boosting has a slight edge in F1-Score on the test set,
suggesting a better balance of precision and recall compared to XGBoost. The performance
differences between the models are relatively small, indicating that all three are good
candidates for deployment.
Recommendations
Model Selection:
In this regard, Gradient Boosting emerged as the top performer, achieving the highest F1-
score among the models evaluated. This signifies its superior balance between precision
(minimizing false alarms) and recall (capturing the majority of actual downtime events).
Top Features:
Both models strongly prioritize Hydraulic Pressure (Pa), Torque (Nm), and Cutting (N) as the
most influential factors. This suggests that variations in these parameters significantly impact
machine failures.
The one-hot encoded Machine_ID features have the lowest importance: This
suggests that machine-specific factors are not as crucial as operational
parameters (e.g., pressure, torque, cutting force).
XGBoost:
Gradient Boosting:
Hydraulic Pressure (Pa), Torque (Nm), and Cutting (N) are the strongest
predictors of machine downtime. If these parameters exceed a threshold,
the likelihood of failure increases.
Coolant and spindle-related factors play a secondary role, suggesting that
temperature regulation and machine stability (vibration) contribute to
faults.
In [24]: # Get feature importance for the optimized Gradient Boosting model
gb_feature_importance = get_feature_importance(best_gb, "Gradient Boosting
# Get feature importance for the optimized XGBoost model
xgb feature importance = get feature importance(best xgb, "XGBoost")