0% found this document useful (0 votes)

124 views

Data Science

This document discusses machine learning fairness and automated machine learning (AutoML). It defines fairness in machine learning models and discusses causes of unfairness such as skewed data samples and limited features. It also discusses challenges in ensuring fairness and the benefits of interpretable machine learning models for fairness. The document then discusses AutoML and common tasks automated by AutoML systems like data preprocessing, feature engineering, model selection and hyperparameter optimization. Finally, it discusses some popular AutoML frameworks like TPOT, Auto-Sklearn, H2O and the ongoing role of data science experts with increased AutoML capabilities.

Uploaded by

Mohamed Harun

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views

Data Science

Uploaded by

Mohamed Harun

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Self Introduction

Latest Happening in Data Science

By
Manikandan
ML Fairness

AutoML
Outline

Domain Experts in DS
Fairness of Machine Learning Model (FairML)
Mention of ML Fairness in Research Papers
Thoughts ?

What if I told you Computers can treat you unfair ?

Would you believe me ?

Google Translation In Action
Commercial Gender Image Classification
There was interesting paper submitted about Gender Shades: Intersectional Accuracy Disparities in Commercial Gender
Classification which reveals bias in the commercial algorithms.
Microsoft's twitter-based AI Chabot Tay
XING, a job platform similar to Linked-in
The list goes on………..
Bias in ML has been almost inevitable when the
application is involved in people.
It has already hurt the benefit of people in minority
groups or historically disadvantageous groups.
If no one cares, it is highly likely that the next person
who suffers from biased treatment is one of us.
Definition of Fairness

● Group Fairness
Partitions a population into groups defined by protected
attributes(such as gender, caste, or religion) and seeks for some
statistical measure to be equal across groups.

● Individual Fairness
similar individuals should be treated similarly.
ML Unfairness - Causes (Data)

● Skewed sample
● Tainted examples
● Limited features
● Sample size disparity
● Proxies
Difficulties in ensuring ML Algorithm is Fair
Interpretable Machine Learning
IML Benefits

 Fairness: Ensuring that predictions are unbiased and do not implicitly or explicitly
discriminate against protected groups. An interpretable model can tell you why it has
decided that a certain person should not get a loan, and it becomes easier for a human
to judge whether the decision is based on a learned demographic (e.g. racial) bias.
 Privacy: Ensuring that sensitive information in the data is protected.
 Reliability or Robustness: Ensuring that small changes in the input do not lead to
large changes in the prediction.
 Trust: It is easier for humans to trust a system that explains its decisions compared to a
black box.
IML Architecture
Preferred Explaining - Model Interpretation
Way to go….
Explainability and Fairness - Just one `pip` away
 lime - https://github.com/marcotcr/lime
 shap - https://github.com/slundberg/shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
 eli5 - https://github.com/TeamHG-Memex/eli5
 scikit-lego - https://github.com/koaning/scikit-lego
from sklego.preprocessing import InformationFilter
from sklego.linear_model import FairClassifier
 What-if Tool - https://pair-code.github.io/what-if-tool/
 Captum - https://github.com/pytorch/captum
Is only the organization haivng the protected
data being responsible for bringing the digital
fairness?
Off course not.

Government also need to bring in proper Data

Regulations to avoid using the Personal &
Protected data.
Do we have any Data regulation in the Word ?

Yes, GDPR in Europe Union

What changes that GDPR enforced ?
GDPR in Action
Automation of Machine Learning (AutoML)
Team Data Science Process lifecycle

The Team Data Science

Process (TDSP) is an
agile, iterative data
science methodology
to deliver predictive
analytics solutions and
intelligent applications
efficiently.
Roles & Responsibilities associated with Lifecycle
Automated Machine Learning( AutoML)

What Wikipedia says…

 AutoML is the process of automating end-to-end the process of

applying machine learning to real-world problems.

 In a typical machine learning application, practitioners would do

 Data pre-processing
 Feature engineering
 Feature extraction
 Feature selection
 Algorithm selection
 Hyperparameter optimization
 Validation

 As many of these steps are often beyond the abilities of non-experts, AutoML
was proposed as an artificial intelligence-based solution to the ever-growing
challenge of applying machine learning.
Targets of AutoML

1) Automated data preparation and ingestion (from raw data and miscellaneous
formats)
 Automated column type detection; e.g., boolean, continuous, or text
 Automated column intent detection; e.g., target/label
 Automated task detection; e.g., binary classification, regression, clustering.
2) Automated feature engineering
 Feature selection
 Feature extraction
 Detection and handling of skewed data and/or missing values
3) Automated model selection
4) Hyperparameter optimization of the learning algorithm and featurization
5) Automated selection of evaluation metrics / validation procedures
6) Automated analysis of results obtained
7) User interfaces and visualizations for automated machine learning

Advantages of AutoML

 Increases productivity by automating repetitive tasks. This enables a data scientist to focus more on the problem rather
than the models.
 Automating the ML pipeline also helps to avoid errors that might creep in manually.
 Ultimately, AutoML is a step towards democratizing machine learning by making the power of ML accessible to everybody.
AutoML Frameworks

MLBox

MLBox is a powerful automated machine learning Python library.

According to the official documentation, this library provides the
following features:

 Fast reading and distributed data preprocessing/cleaning/formatting.

 Highly robust feature selection, leak detection, and accurate
hyperparameter optimization
 State-of-the-art predictive models for classification and regression
(Deep Learning, Stacking, LightGBM,…)
 Prediction with model interpretation
 It has already been tested on Kaggle and performs well.

Compatibilities:

 Operating systems: Linux, MacOS & Windows.

 Python versions: 3.5 - 3.7. & 64-bit version only (32-bit
python is not supported)
Auto-Sklearn

 Auto-Sklearn is an automated machine learning package

built on top of Scikit-learn.
 Auto-sklearn frees a machine learning user from
algorithm selection and hyperparameter tuning.
 It includes feature engineering methods such as one-
hot encoding, numeric feature standardization, PCA,
and more.
 Auto-sklearn performs well on small and medium-sized
datasets, but it cannot be applied to modern deep
learning systems that yield state-of-the-art performance
on large datasets.

Compatibilities:

 Operating systems: Linux

 Python (>=3.5)
 C++ compiler (with C++11 supports)
 SWIG (version 3.0 or later)
Tree-Based Pipeline Optimization Tool (TPOT)

 TPOT is a Python automated machine learning tool

that optimizes machine learning pipelines using
genetic programming.
 TPOT extends the Scikit-learn framework but with
its own regressor and classifier methods. TPOT is
built on top of scikit-learn, so all of the code it
generates should look familiar... if you're familiar
with scikit-learn.
 TPOT works by exploring thousands of possible
pipelines and finding the best one for your data. So
we it will run a while to run for large dataset.
 TPOT cannot automatically process natural
language inputs. Additionally, it’s also not able to
processes categorical strings, which must be
integer-encoded before being passed in as data.
 TPOT is built on top of several existing Python
libraries, including:
NumPy, SciPy, scikit-learn, DEAP, update_checker,
tqdm, stopit, pandas, joblib
 We also strongly recommend that you use of
Python 3 over Python 2 if you're given the choice.
H2O AutoML

 H2O is a fully open source, distributed in-memory

machine learning platform from the
company H2O.ai.
 With support for both R and Python, H2O supports
the most widely used statistical & machine learning
algorithms, including gradient boosted machines,
generalized linear models, deep learning models,
and more.
 H2O includes an automatic machine learning
module that uses its own algorithms to build a
pipeline. It performs an exhaustive search over its
feature engineering methods and model
hyperparameters to optimize its pipelines.
 H2O automates some of the most difficult data
science and machine learning workflows, such as
feature engineering, model validation, model
tuning, model selection and model deployment.
 In addition to this, it also offers automatic
visualizations and machine learning interpretability
(MLI).
If all the Feature Engineering, Model building &
Model Fine Tuning are automated, then what’s
the scope of Data Science Expert ?
“The need of the hour today is marrying academic
elegance with business domain knowledge. It is
the time for bilingual people who speak the
business lingo and have sound data science
concepts”

Great Learning’s Dr PK Vishawanathan in Cypher 2019

Inspiration
 https://in.pycon.org/cfp/2019/proposals/machine-learning-bias~e1Aje/
 https://towardsdatascience.com/interpretable-machine-learning-1dec0f2f3e6b
 https://heartbeat.fritz.ai/automl-the-next-wave-of-machine-learning-5494baac615f
 https://arxiv.org/pdf/1808.06492v1.pdf
 https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-
process/overview
 https://automl.github.io/auto-sklearn/master/#
 https://towardsdatascience.com/a-tutorial-on-fairness-in-machine-learning-
3ff8ba1040cb
Thank You

By
Manikandan
Gmail - [email protected]
LinkedIn - www.linkedin.com/in/manikandan1191

Your Baby Can Read!
67% (3)
Your Baby Can Read!
2 pages
1 - Course Slides - Data Science and ML Fundamentals
No ratings yet
1 - Course Slides - Data Science and ML Fundamentals
92 pages
Explain Machine Learning Model Using SHAP
No ratings yet
Explain Machine Learning Model Using SHAP
28 pages
Data Ethics Framework 2
No ratings yet
Data Ethics Framework 2
23 pages
(IJETA-V8I5P1) :yew Kee Wong
No ratings yet
(IJETA-V8I5P1) :yew Kee Wong
5 pages
ML0101EN Clas Logistic Reg Churn Py v1
100% (1)
ML0101EN Clas Logistic Reg Churn Py v1
13 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
No ratings yet
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
4 pages
Ch-4 Ethics in Data Science PPT Vasu Sharma 9-A
No ratings yet
Ch-4 Ethics in Data Science PPT Vasu Sharma 9-A
18 pages
Feature Engineering / Feature Selection
No ratings yet
Feature Engineering / Feature Selection
33 pages
AIML - 04 Single Layer Perceptron
No ratings yet
AIML - 04 Single Layer Perceptron
11 pages
Applications of Machine Learning To Optimize Tennis
No ratings yet
Applications of Machine Learning To Optimize Tennis
20 pages
Deep Learning
No ratings yet
Deep Learning
39 pages
Building Powerful Image Classification Models Using Very Little Data
No ratings yet
Building Powerful Image Classification Models Using Very Little Data
20 pages
The Application of Machine Learning For Sport Result Prediction A Review
No ratings yet
The Application of Machine Learning For Sport Result Prediction A Review
49 pages
Machine Learning For Parkinson's Disease Prediction
No ratings yet
Machine Learning For Parkinson's Disease Prediction
8 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
Parkison's Diseases Prediction Using Machine Learning
No ratings yet
Parkison's Diseases Prediction Using Machine Learning
10 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Machine Learning in Python Main Developments and T
100% (1)
Machine Learning in Python Main Developments and T
44 pages
Diabetes Prediction Report
No ratings yet
Diabetes Prediction Report
16 pages
Parkinsons Disease Prediction - Ieee
No ratings yet
Parkinsons Disease Prediction - Ieee
5 pages
Churn Modeling
100% (1)
Churn Modeling
11 pages
A Machine Learning Framework For Sport Result Prediction
No ratings yet
A Machine Learning Framework For Sport Result Prediction
7 pages
UNIT-4
No ratings yet
UNIT-4
79 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
AI Bias V3
No ratings yet
AI Bias V3
16 pages
Role of Machine Learning in The Field of Fiber Reinforced Polymer
No ratings yet
Role of Machine Learning in The Field of Fiber Reinforced Polymer
6 pages
Tutorials - Software Engineering
No ratings yet
Tutorials - Software Engineering
5 pages
Bias and Variance
No ratings yet
Bias and Variance
6 pages
ML - Expectation-Maximization Algorithm
No ratings yet
ML - Expectation-Maximization Algorithm
3 pages
Data Science Resource Package!
No ratings yet
Data Science Resource Package!
14 pages
Face Detection and Smile Detection
No ratings yet
Face Detection and Smile Detection
8 pages
Ai Performance Sports
No ratings yet
Ai Performance Sports
2 pages
Logistics Regression
100% (1)
Logistics Regression
5 pages
2.building Blocks of Neural Networks
100% (1)
2.building Blocks of Neural Networks
2 pages
MACHINE LEARNING AND DATA ANALYTICS USING PYTHON LAB
No ratings yet
MACHINE LEARNING AND DATA ANALYTICS USING PYTHON LAB
36 pages
Documentation (218609p)
No ratings yet
Documentation (218609p)
65 pages
Econ209 f2024 Lab 4 Truong Gia Han
No ratings yet
Econ209 f2024 Lab 4 Truong Gia Han
11 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Project
No ratings yet
Project
39 pages
TP Regression
100% (1)
TP Regression
1 page
Machine Learning Module-3
No ratings yet
Machine Learning Module-3
23 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Python Machine Learning - Machine Learning and Deep Learning With Python Scikit Learn and Tensorflow 2 Third Edition
No ratings yet
Python Machine Learning - Machine Learning and Deep Learning With Python Scikit Learn and Tensorflow 2 Third Edition
4 pages
Survey of Machine Learning Algorithms For Disease Diagnostic
No ratings yet
Survey of Machine Learning Algorithms For Disease Diagnostic
16 pages
Real-Time Face Detection On A "Dual-Sensor" Smart Camera Using Smooth-Edges Technique
No ratings yet
Real-Time Face Detection On A "Dual-Sensor" Smart Camera Using Smooth-Edges Technique
5 pages
DL Lab Manual
No ratings yet
DL Lab Manual
65 pages
Soft Max
No ratings yet
Soft Max
6 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
9 pages
02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
CCS355 Neural Networks and Deep Learning Lab
No ratings yet
CCS355 Neural Networks and Deep Learning Lab
43 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
Logistic Regression
No ratings yet
Logistic Regression
41 pages
Ml Lab Manual (5cs4-23)
No ratings yet
Ml Lab Manual (5cs4-23)
53 pages
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
02 Federalism
No ratings yet
02 Federalism
16 pages
String Programs
No ratings yet
String Programs
88 pages
PDF (Ebook) Modern Mandarin Chinese Grammar Workbook by Claudia Ross, Jing-Heng Sheng Ma, Baozhang He, Pei-Chia Chen ISBN 9780415834889, 0415834880 download
100% (5)
PDF (Ebook) Modern Mandarin Chinese Grammar Workbook by Claudia Ross, Jing-Heng Sheng Ma, Baozhang He, Pei-Chia Chen ISBN 9780415834889, 0415834880 download
71 pages
Tenses (L-4)
No ratings yet
Tenses (L-4)
12 pages
Nelson Mandela - Long Walk To Freedom
No ratings yet
Nelson Mandela - Long Walk To Freedom
6 pages
At The Chemist's HandoutsOnline
No ratings yet
At The Chemist's HandoutsOnline
2 pages
Installation Guide of Jopas-Demo: Grupo Pas - Universidad de Deusto
No ratings yet
Installation Guide of Jopas-Demo: Grupo Pas - Universidad de Deusto
3 pages
Preprint_TurkLang 2024 (1)_compressed
No ratings yet
Preprint_TurkLang 2024 (1)_compressed
6 pages
NE
No ratings yet
NE
22 pages
Compact preliminary B1 English for Spanish speakers answers
No ratings yet
Compact preliminary B1 English for Spanish speakers answers
4 pages
Greetings and Basic Expressions
No ratings yet
Greetings and Basic Expressions
3 pages
Decolonization in Britain and France The Domestic Consequences of International Relations Miles Kahler pdf download
100% (1)
Decolonization in Britain and France The Domestic Consequences of International Relations Miles Kahler pdf download
62 pages
Bwrkagepzbyc: 9 - A. B. C. D - E. F. G. H. J. K. L. M - N. P. R. S. T. Y
No ratings yet
Bwrkagepzbyc: 9 - A. B. C. D - E. F. G. H. J. K. L. M - N. P. R. S. T. Y
4 pages
Question 1
No ratings yet
Question 1
8 pages
2622 14211 4 PB
No ratings yet
2622 14211 4 PB
12 pages
Colloquial and Slangs
No ratings yet
Colloquial and Slangs
6 pages
Reports Are Usually Written For A Person in Authority
No ratings yet
Reports Are Usually Written For A Person in Authority
2 pages
English CP 2020 Solved
33% (3)
English CP 2020 Solved
76 pages
Speaking Rubric
No ratings yet
Speaking Rubric
1 page
Maggie Bonafede-Resume
No ratings yet
Maggie Bonafede-Resume
3 pages
Review of Related Literature
No ratings yet
Review of Related Literature
12 pages
Lectia 3
No ratings yet
Lectia 3
5 pages
9th Eng 12-12-22 ANS Paper
No ratings yet
9th Eng 12-12-22 ANS Paper
3 pages
Socio Culture of Netherlands - : Workforce Composition
No ratings yet
Socio Culture of Netherlands - : Workforce Composition
5 pages
Study English - IELTS Preparation Series 3
No ratings yet
Study English - IELTS Preparation Series 3
2 pages
Translation of The Jacket, A Contextual and Translation Procedure Commentary
No ratings yet
Translation of The Jacket, A Contextual and Translation Procedure Commentary
47 pages
The Heyday of Analysis
No ratings yet
The Heyday of Analysis
5 pages
05 OW TELP L6 U5 Optimized
No ratings yet
05 OW TELP L6 U5 Optimized
30 pages
RAW M2 4 NW
No ratings yet
RAW M2 4 NW
43 pages

Data Science

Uploaded by

Data Science

Uploaded by

Self Introduction

Latest Happening in Data Science

What if I told you Computers can treat you unfair ?

Would you believe me ?

Government also need to bring in proper Data

Yes, GDPR in Europe Union

The Team Data Science

What Wikipedia says…

 AutoML is the process of automating end-to-end the process of

 In a typical machine learning application, practitioners would do

MLBox is a powerful automated machine learning Python library.

 Fast reading and distributed data preprocessing/cleaning/formatting.

 Operating systems: Linux, MacOS & Windows.

 Auto-Sklearn is an automated machine learning package

 Operating systems: Linux

 TPOT is a Python automated machine learning tool

 H2O is a fully open source, distributed in-memory

Great Learning’s Dr PK Vishawanathan in Cypher 2019

You might also like