Military AI-Week 02-Key Concept Machine Learning
Military AI-Week 02-Key Concept Machine Learning
51
Slide credit: Ray Mooney
History of Machine Learning (cont.)
• 1980s:
– Advanced decision tree and rule learning
– Explanation-based Learning (EBL)
– Learning and planning and problem solving
– Utility problem
– Analogy
– Cognitive architectures
– Resurgence of neural networks (connectionism, backpropagation)
– Valiant’s PAC Learning Theory
– Focus on experimental methodology
• 1990s
– Data mining
– Adaptive software agents and web applications
– Text learning
– Reinforcement learning (RL)
– Inductive Logic Programming (ILP)
– Ensembles: Bagging, Boosting, and Stacking
– Bayes Net learning
52
Slide credit: Ray Mooney
History of Machine Learning (cont.)
• 2000s
– Support vector machines & kernel methods
– Graphical models
– Statistical relational learning
– Transfer learning
– Sequence labeling
– Collective classification and structured outputs
– Computer Systems Applications (Compilers, Debugging, Graphics, Security)
– E-mail management
– Personalized assistants that learn
– Learning in robotics and vision
• 2010s
– Deep learning systems
– Learning for big data
– Bayesian methods
– Multi-task & lifelong learning
– Applications to vision, speech, social networks, learning to read, etc.
– ???
53
Based on slide by Ray Mooney
Machine Learning as Scientific Field
A scientific field that:
⚫ research fundamental principles
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
4
Slide credit: Pedro Domingos
When Do We Use Machine Learning?
ML is used when:
• Human expertise does not exist (navigating on Mars)
• Humans can’t explain their expertise (speech recognition)
• Models must be customized (personalized medicine)
• Models are based on huge amounts of data (genomics)
6
Slide credit: Geoffrey Hinton
Some more examples of tasks that are best
solved by using a learning algorithm
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences
• Recognizing anomalies:
– Unusual credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant
• Prediction:
– Future stock prices or currency exchange rates
7
Slide credit: Geoffrey Hinton
Sample Applications
• Web search
• Computational biology
• Finance
• E-commerce
• Space exploration
• Robotics
• Information extraction
• Social networks
• Debugging software
• [Your favorite area]
8
Slide credit: Pedro Domingos
Samuel’s Checkers-Player
“Machine Learning: Field of study that gives
computers the ability to learn without being
explicitly programmed.” -Arthur Samuel (1959)
9
Defining the Learning Task
Improve on task T, with respect to
performance metric P, based on experience E
T: Playing checkers
P: Percentage of games won against an arbitrary opponent
E: Playing practice games against itself
11
Autonomous Cars
13
Autonomous Car Technology
Path
Planning
Sebastian
Stanley
23
Types of Learning
24
Based on slide by Pedro Domingos
Machine Learning Types and Its Applications
Supervised Learning - Regression Questions
• Regression: reflects the features of attribute values of samples in a sample dataset. The
dependency between attribute values is discovered by expressing the relationship of sample
mapping through functions.
• How much will I benefit from the stock next week?
• What's the temperature on Tuesday?
Supervised Learning: Regression
• Given (x 1 , y1), (x 2 , y2), ..., (x n , yn)
• Learn a function f(x) to predict y given x
– y is real-valued == regression
9
8
September Arctic Sea Ice Extent
7
(1,000,000 sq km)
6
5
4
3
2
1
0
1970 1980 1990 2000 2010 2020
Year
26
Data from G. Witt. Journal of Statistics Education, Volume 21, Number 1 (2013)
Supervised : Classification
Data feature Label
Supervised learning
Feature 1 ... Feature n Goal
algorithm
Wind Enjoy
Weather Temperature
Speed Sports
Sunny Warm Strong Yes
Rainy Cold Fair No
Sunny Cold Weak Yes
Supervised Learning: Classification
• Given (x 1 , y1), (x 2 , y2), ..., (x n , yn)
• Learn a function f(x) to predict y given x
– y is categorical == classification
Breast Cancer (Malignant / Benign)
1(Malignant)
0(Benign)
Tumor Size
27
Based on example by Andrew Ng
Supervised Learning: Classification
• Given (x 1 , y1), (x 2 , y2), ..., (x n , yn)
• Learn a function f(x) to predict y given x
– y is categorical == classification
Breast Cancer (Malignant / Benign)
1(Malignant)
0(Benign)
Tumor Size
1(Malignant)
0(Benign)
Tumor Size
Predict Benign Predict Malignant
- Clump Thickness
- Uniformity of Cell Size
Age - Uniformity of Cell Shape
…
Tumor Size
30
Based on example by Andrew Ng
Unsupervised Learning - Clustering Questions
• Clustering: classifies samples in a sample dataset into several
categories based on the clustering model. The similarity of samples
belonging to the same category is high.
• Which audiences like to watch movies
of the same subject?
• Which of these components are
damaged in a similar way?
Unsupervised Learning
• Given x 1 , x 2 , ..., x n (without labels)
• Output hidden structure behind the x’s
– E.g., clustering
31
Unsupervised Learning
Genomics application: group individuals by genetic similarity
Genes
Individuals 32
[Source: Daphne Koller]
Unsupervised Learning
34
Image credit: statsoft.com Audio from http://www.ism.ac.jp/~shiro/research/blindsep.html
Unsupervised Learning
• Independent component analysis – separate a
combined signal into its original sources
35
Image credit: statsoft.com Audio from http://www.ism.ac.jp/~shiro/research/blindsep.html
Semi-Supervised Learning
Data Feature Label
Semi-supervised
Feature 1 ... Feature n Unknown
learning algorithms
Wind Enjoy
Weather Temperature
Speed Sports
Sunny Warm Strong Yes
Rainy Cold Fair /
Sunny Cold Weak /
Reinforcement Learning
• Given a sequence of states and actions with
(delayed) rewards, output a policy
– Policy is a mapping from states → actions that
tells you what to do in a given state
• Examples:
– Credit assignment problem
– Game playing
– Robot in a maze
– Balance a pole on your hand
36
The Agent-Environment Interface
... rt +1 s rt +2 s rt +3 s ...
st a t +1 at +1 t +2 at +2 t +3 at +3
t
37
Slide credit: Sutton & Barto
Reinforcement Learning
https://www.youtube.com/watch?v=4cgWya-wjgY 38
Inverse Reinforcement Learning
• Learn policy from user demonstrations
• Classification aims to assign each input to one of a finite number of categories (target values
are discrete).
• Regression aims to assign each input to a value in a continuous set of possible target values
• Probability estimation is a special case of regression where target values range between 0
and 1 and represent probability values.
• Clustering aims to discover groups of similar examples within the input space.
• Density estimation aims to determine the distribution of data within the input space.
• Projection/dimensionality reduction aims to obtain a representation of data in a dimension
different (typically lower) than its original dimension.
• Credit assignment aims to determine a way to reward (or punish) every action the algorithm
provides so that at the end of the action sequence, it arrives at the best/correct answer.
12
Machine Learning Process
Feature Model
Data Model Model
Data cleansing extraction and deployment and
collection training evaluation
selection integration
4 80 9 Southeast 1100
Data
Data
Data cleansing
preprocessing normalization
Normalize data to
reduce noise and
Fill in missing values, improve model
and detect and accuracy.
eliminate causes of
dataset exceptions. Data dimension
reduction
Simplify data
attributes to avoid
dimension explosion.
Data Cleansing
Most machine learning models process features, which are usually numeric representations of input
variables that can be used in the model.
In most cases, the collected data can be used by algorithms only after being preprocessed. The
preprocessing operations include the following:
Data filtering
Processing of lost data
Processing of possible exceptions, errors, or abnormal values
Combination of data from multiple data sources
Data consolidation
Dirty Data (1)
• Generally, real data may have some quality
problems.
• Incompleteness: contains missing values or the data
that lacks attributes
• Noise: contains incorrect records or exceptions.
• Inconsistency: contains inconsistent records.
Dirty Data (2)
#Stu
IsTea
# Id Name Birthday Gender dent Country City
cher
s
After being preprocessed, the data needs to be converted into a representation form
suitable for the machine learning model. Common data conversion forms include the
following:
With respect to classification, category data is encoded into a corresponding numerical
representation.
Value data is converted to category data to reduce the value of variables (for age segmentation).
Other data
In the text, the word is converted into a word vector through word embedding (generally using the
word2vec model, BERT model, etc).
Process image data (color space, grayscale, geometric change, Haar feature, and image enhancement)
Feature engineering
Normalize features to ensure the same value ranges for input variables of the same model.
Feature expansion: Combine or convert existing variables to generate new features, such as the
average.
Machine Learning Model
Machine learning
GBDT GBDT
KNN
Naive Bayes
Linear Regression (1)
⚫ Linear regression: a statistical analysis method to determine the quantitative relationships between
two or more variables through regression analysis in mathematical statistics.
⚫ Linear regression is a type of supervised learning.
hw ( x) = wT x + b
⚫ The relationship between the value predicted by the model and actual value is as follows, where 𝑦 indicates the actual
value, and 𝜀 indicates the error.
y = w x+b+
T
⚫ The error 𝜀 is influenced by many factors independently. According to the central limit theorem, the error 𝜀 follows normal
distribution. According to the normal distribution function and maximum likelihood estimation, the loss function of linear
regression is as follows:
1
J ( w) = ( hw ( x) − y )
2
2m
⚫ To make the predicted value close to the actual value, we need to minimize the loss value. We can use the gradient descent
method to calculate the weight parameter 𝑤 when the loss function reaches the minimum, and then complete model
building.
Discussion Topic
The Essence of ML
Traditional Programming
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
ML Analogy
Seeds
(Algorithm)
Nutrients
(Data)
Gardener
(You)
Plants (Program)
General Approach in ML
https://towardsdatascience.com/4-machine-learning-approaches-that-every-data-scientist-should-know-e3a9350ec0b9
Review Types of ML
Nearest Neighbor
Notation:
Rain
No Rain
? Unknown
Pressure
Humidity
k-Nearest-Neighbor (k-NN) Classification 66
k-Nearest-Neighbor is a classification method that maps inputs into the most common
class (the majority) of the closest “k” data points from the input.
Example, k=3
? =1
Pressure
=2
Humidity
Defining Distance in k-NN 67
The Value of K in k-NN 68
Let’s Code K-NN 69
Task:
Given some characteristics, determine the fruit class
Let’s Code K-NN (2) 70
Let’s Code K-NN (3) 71
Let’s Code K-NN (4) 72
Let’s Code K-NN (5) 73
Let’s Code K-NN (6) 74
Linear Regression 75
• Link Dataset →
https://drive.google.com/file/d/1_r3huDi9I6Oj0BXfZBGJlqGjAxwDWRGy/
view?usp=sharing
Task:
Given an area data, determine the estimated price
Let’s Code Regression (2) 78
Let’s Code Regression (3) 79
Let’s Code Regression (4) 80
Clustering – K Means 81
• K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest
mean, serving as a prototype of the cluster.
• For the k-means algorithm, specify the final number of clusters (k). Then, divide n data objects into k clusters. The clusters obtained
meet the following conditions: (1) Objects in the same cluster are highly similar. (2) The similarity of objects in different clusters is
small.
• K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the
centroids as small as possible.
• The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
Let’s Code K-Means 82
Let’s Code K-Means (2) 83
Thank You