Presentation 7 (2)
Presentation 7 (2)
Machine
Learning
Pipeline
Muhammad Omer - i220572
Shariq Usman - i220447
Azeem Chaudary - i220479
Introduction to Problems
• Data Imbalance:
o Class 0 (60.2%), Class 1 (39.8%).
• Missing Values
• 18.5% rows missing
• Distribution Issues:
• Skewed features (e.g., feature_4)
• Outliers (e.g., feature_7)
• Correlation Issues:
• Weak target correlation
• Dataset too small
• GPU slowed things down
Proposed Solution ML
1 Class Imbalance:
Applied under sampling
Preproces
2 sing
Missing Data:
Filled using median
3 Under Clustered to
sampling new feature
Feature Engineering:
• Log-scaled Feature 4
• Squared Feature 4
• Sine & cosine transform on Feature 6
• Clustered adjusted Feature 4 & Feature 7 Median
filling Periodic
Normalizatio
n
Parallelization
Approach
• Multi-threading: Ran tasks on all CPU
cores for preprocessing, training, and
testing where there’s parallel processing
possible.
Paralleli
• Broke data into chunks: Processed pieces ze
at the same time. Split multi-
tasks threading
• Best:
• XGBoost (59.38% accuracy)
• RFC (88.69% speedup)
• Resource Use:
• Memory: 486.58 MB.
• CPU: 0.00%.
Conclusion
• Achieved 80% faster processing,
best accuracy at 59.38%.
• Worked well on old systems with
low resource use
Future Work:
• Test with larger data for
better results.
• Explore advanced models for
higher accuracy
Machine parallel
processing performanc
AI Learning
e
distributed
LLMs
computing
Project
planning
Plannin
g Equipment
Prompts Dollars
Strategy Profit