Machine Learning - course
Machine Learning - course
to describe things? Things that computers couldn't do, can do now with machine
learning, in a specific way, the goal of machine learning is to make machines act
more like humans based on data and algorithms
data science - overlaps ML, analyzing data and doing sth with it
machine learning facilitates the decision making in terms of business, they are
better at making business decisions and predictions based on massive amount of data
Unsupervised -- clustering (we or -- the machines have to create these groups based
on the similarity of data), association rule learning (associate diff things to
predict, based on relationships between data, variables in data) --> data without
labels, eg csv without column names
Reinforcement --> teaching machines through trial and error, machine, f.e. learns a
game by learning a million of times to learn how to win etc.
ML - uses an algorithms to learn about patterns in data and learn based on them to
learn and predict sth in the future
ML Algo we start with an input and an ideal output, it look at the input and at the
output and tries to figure out the set of instructions in between these two
find patterns in data so that we can use them in the future to predict sth
////////////////////////////////////////////////////////////////////////////////
1. Create a framework
2. Match the framework to viable ML/DS tools
3. Learn by doing
Supervised learning, is called supervised because you have data and labels. A
machine learning algorithm tries to learn what patterns in the data lead to the
labels.
Unsupervised learning is when you have data but no labels. The data could be the
purchase history of your online video game store customers. Using this data, you
may want to group similar customers together so you can offer them specialised
deals. You could use a machine learning algorithm to group your customers by
purchase history.
After inspecting the groups, you provide the labels. There may be a group
interested in computer games, another group who prefer console games and another
which only buy discounted older games. This is called clustering.
What’s important to remember here is the algorithm did not provide these labels. It
found the patterns between similar customers and using your domain knowledge, you
provided the labels.
Transfer learning is when you take the information an existing machine learning
model has learned and adjust it to your own problem.
https://www.mrdbourke.com/a-6-step-field-guide-for-building-machine-learning-
projects/
Unstructured data — Anything not immediately able to be put into row and column
format, images, audio files, natural language text.
Static data — Existing historical data which is unlikely to change. Your companies
customer purchase history is a good example.
Streaming data — Data which is constantly updated, older records may be changed,
newer records are constantly being added. f.i. news headlines
Supervised/Unsupervised/Transfer/Reinforcement Learning
data - labels, ML Algo uses the data to match the labels
classification predicting if one things is one or another, more thatn two options,
multi calss classification
regression trying to predict a number
data- no labels
purchase history of ur clients, find patterns in data and match similar things
together
cluster algo matches similar data based on patterns and based on that you label
them
reinforcement one - through trial and error, unitll it wins, for example, in a game
what features should you use ? - only when samples has similar information
Feature coverage - how many samples have different features? Ideally, every sample
has the same features
Modelling - 3 sets
based our problem and data, what ML model should we use
The most important concept in ML (the training, validation and test sets or 3 sets,
1 for training, 1 for validation and tuning and 1 for testing and comparing)
Course materials (training set) -> Practice exam (validation set) -> Final exam
(test set)
Generalization - the ability for a ml model to perform well on data it hasn't seen
before
we want to avoid memorization and training our odel on data it's seen before so we
split the data:
70-80% training split, 10-15% validation split (we check results there, model
tuning, improve them), 10-15% test split
Structured data ---> catboost, dmlcxgboost, gradient boosting, random forest tend
do wokr best
Unstructued data ---> deep learning, transfer learning, neural networks work best
tuning can happen both on the training(if there's no validation set) and the
validation set
ml models habe hyperparameteers we can adjuts
a models first results arent its last
improve the model
training 98%
test 86% accuracy --> it's ok
why under/overfitting?
data leakage - test data leaks into training data
testing always stays the same - test set!!!
data mismatch - when data we test on is different from the data we train on, having
different features in testing and training data
overfitting
collect more data
try a less advanced model
when comparing 2 models, we compare apples to apples, oranges to oranges, the same
data
is the extra accuracy worth the longer time ? etc. etc.
WHAT IS CONDA?
ANACONDA/MINICONDA/CONDA
HARDWARE STORE/WORKBENCH/PERSONAL ASSISTANT
ANACONDA AND MINICONDA ARE SOFTWARE DISTRIBUTIONS, THEY COME WITH CODE OTHER PEOPLE
HAVE WRITTEN, PACKAGES
ANACONDA ~ 3GB it installs all of the major data science packages, many of them
dont get used
MINICONDA - the most useful packages, minimum requirements ~ 200MB
CONDA - comes with miniconda, it can help me set up my tools and my workbench etc.,
conda is a package manager, helps me download, install and manage the packages has
the abillity to create environments
environment ---- collection of tools and packages we might want to use in our
project
conda is what we use to create the environment and install and update the tools we
wanna use
conda allows us to share the same tools and packages with others, we can just send
the project folder with the environment etc.
that's where 2 comes in. A .yml is basically a text file with instructions to tell
Conda how to set up an environment.
m markdown
y code
esc + y/m
enter goes back to the cell
when we exit a notebook and open it again, all the cells have to be executed again
AI - area of CS, where the goal is to enable computers to act like humans & perform
human-like tasks
SL - uses labeled inputs (the input has a corresponding output label) to train
models and learn outputs
UL - uses unlabeled data to learn about patterns in the data, groups similar data,
similar data, (f.e. clustering)
RL - agent learning in interactive environment based on rewards and penalties
ordinal data (inherent order), f.e. bad, not so good, mediocre, good, great etc.,
lets give it a number 1 2 3 4 5 etc. 5 - great 1 - bad etc.
TYPES OF PREDICTIONS
SL:
classification - predict discrete classes (this is a dog, this is a cat etc.)
multi-class classification cat/dog/lizard etc.
binary - hot dog, not hot dog, cat/dog, positive/negative