Studi Kasus Machine Learning Dan Data Mining
Studi Kasus Machine Learning Dan Data Mining
MACHINE LEARNING
DAN DATA MINING
SUPENO MARDI
Kelas Logistik dan Jadwal
• 36 Pertemuan
• Software yang digunakan
– Python 3
– TensorFlow (TF) + Keras
• Final Project + Presentasi
Daftar isi
• Terminologi AI, Machine Learning dan data mining
• Learning data untuk model
• Tipe-tipe Tugas Belajar (Learning Tasks)
• Pendefinisian tugas belajar (Learning Task)
• Contoh-contoh kasus machine Learning
• Data Mining
Terminologi
• Sinonim
– Artificial Intelligence
– Machine Learning
– Data mining
– Pattern recognition
– Probability and Statistics
– Information theory
– Numerical optimization
– Computational complexity theory
– Control theory (adaptive)
4
Machine Learning,Statistics dan Data Mining
• Differences in terminology:
– Ridge regression = weight-decay
– Fitting = learning
– Held-out data = test data
• The emphasis is very different:
– A good piece of statistics: Clever proof that a relatively
simple estimation procedure is asymptotically unbiased.
– A good piece of machine learning: Demonstration that a
complicated algorithm produces impressive results on a
specific task.
• Data-mining: Using machine learning techniques
on very large databases.
5
DATA MINING
“Learning” Data
• Learning general models dari a data of particular examples
• Data tersedia banyak dan murah(data warehouses, data
marts); knowledge mahal dan jarang.
• Contoh dalam retail: Customer transactions to consumer
behavior:
People who bought “Da Vinci Code” also bought “The Five People You
Meet in Heaven” (www.amazon.com)
• Pembuatan model yang a good and useful approximation to
the data.
Tipe-tipe Tugas Belajar (Learning Tasks)
• Association
• Supervised learning
– Learn to predict output when given an input vector
• Reinforcement learning
– Learn action to maximize payoff
• Payoff is often delayed
• Exploration vs. exploitation
• Online setting
• Unsupervised learning
– Create an internal representation of the input e.g. form clusters;
extract features
• How do we know if a representation is good?
– Big datasets do not come with labels.
8
Learning Associations
• Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y
where X and Y are products/services.
• ...
Face Recognition
Test images
The Role of Learning
13
Penggunaan Supervised Learning
• Prediction of future cases: Use the rule to predict the output
for future inputs
• Knowledge extraction: The rule is easy to understand
• Compression: The rule is simpler than the data it explains
• Outlier detection: Exceptions that are not covered by the rule,
e.g., fraud
Penggunaan Unsupervised Learning
• Learning “what normally happens”
• Clustering: Grouping similar instances
• Example applications
– Customer segmentation in CRM (customer relationship management)
– Image compression: Color quantization
– Bioinformatics: Learning motifs
15
Displaying the structure of a set of documents
16
Contoh: Cancer Diagnosis
• Application: automatic disease detection
• Importance: this is modern/future medical diagnosis.
• Prediction goal: Based on past patients, predict whether you
have the disease
• Data: Past patients with and without the disease
• Target: Cancer or no-cancer
• Features: Concentrations of various proteins in your blood
17
Contoh: Zipcodes
• Application: automatic zipcode recognition
• Importance: this is modern/future delivery of small goods.
• Goal: Based on your handwritten digits, predict what they are
and use them to route mail
• Data: Black-and-white pixel values
• Target: Which digit
• Features: ?
18
What makes a 2?
19
Contoh: Google
• Application: automatic ad selection
• Importance: this is modern/future advertising.
• Prediction goal: Based on your search query, predict which ads
you might be interested in
• Data: Past queries
• Target: Whether the ad was clicked
• Features: ?
20
Contoh: Call Centers
• Application: automatic call routing
• Importance: this is modern/future customer service.
• Prediction goal: Based on your speech recording, predict which
words you said
• Data: Past recordings of various people
• Target: Which word was intended
• Features: ?
21
Contoh: Stock Market
• Application: automatic program trading
• Importance: this is modern/future finance.
• Prediction goal: Based on past patterns, predict whether the
stock will go up
• Data: Past stock prices
• Target: Up or down
• Features: ?
22
Contoh :Web-based
• The web contains a lot of data. Tasks with very big
datasets often use machine learning
– especially if the data is noisy or non-stationary.
• Spam filtering, fraud detection:
– The enemy adapts so we must adapt too.
• Recommendation systems:
– Lots of noisy data. Million dollar prize!
• Information retrieval:
– Find documents or images with similar content.
23
What is a Learning Problem?
• Learning involves performance improving Develop methods, techniques and tools for
– at some task T building intelligent learning machines, that
can solve the problem in combination with
– with experience E an available data set of training examples.
24
Pendefinisian tugas belajar (Learning Task)
Improve on task, T, with respect to
performance metric, P, based on experience, E.
T: Playing checkers
P: Percentage of games won against an arbitrary opponent
E: Playing practice games against itself
T: Recognizing hand-written words
P: Percentage of words correctly classified
E: Database of human-labeled images of handwritten words
T: Driving on four-lane highways using vision sensors
P: Average distance traveled before a human-judged error
E: A sequence of images and steering commands recorded while
observing a human driver.
T: Categorize email messages as spam or legitimate.
P: Percentage of email messages correctly classified.
E: Database of emails, some with human-given labels
Desain sebuah Learning System
• Pilih : training experience
• Pilih : what is too be learned, i.e. the target function.
• Pilih: how to represent the target function.
• Pilih: a learning algorithm to infer the target function from the
experience.
Learner
Environment/
Experience Knowledge
Performance
Element
Komponen-komponen sebuah Learning Problem
• Task: the behavior or task that’s being improved, e.g. classification, object
recognition, acting in an environment.
• Data: the experiences that are being used to improve performance in the
task.
• Measure of improvements: How can the improvement be measured?
Examples:
– Provide more accurate solutions (e.g. increasing the accuracy in prediction)
– Cover a wider range of problems
– Obtain answers more economically (e.g. improved speed)
– Simplify codified knowledge
– New skills that were not presented initially
27
What Experience E to Use?
• Direct or indirect?
– Direct: feedback on individual moves
– Indirect: feedback on a sequence of moves
• e.g., whether win or not
• Teacher or not?
– Teacher selects board states
• Tailored learning
• Can be more efficient
– Learner selects board states
• No teacher
• Questions
– Is training experience representative of performance goal?
– Does training experience represent distribution of outcomes in world?
28
What Exactly Should be Learned?
• Playing checkers:
– Alternating moves with well-defined rules
– Choose moves using some function
– Call this function the Target Function
• Target function (TF): function to be learned during a learning process
– ChooseMove: Board Move
– ChooseMove is difficult to learn, e.g., with indirect training examples
A key to successful learning is to choose appropriate target function:
Strategy: reduce learning to search for TF
29
A Possible Target Function V For Checkers
• In checkers, know all legal moves
V (b) w0 w1 bp (b) w2 rp (b) w3 bk (b) w4 rk (b) w5 bt (b) w6 rt (b)
– From these, choose best move in any situation
• Possible V function for checkers:
– if b is a final board state that is win, then V(b) = 100
– if b is a final board state that is loss, then V(b) = -100
– if b is a final board state that is draw, then V(b) = 0
– if b is a not a final state in the game, then V(b) = V(b), where b is the best final board state that
can be achieved starting from b and playing optimally until the end of the game
⌃
• This gives correct values, but is not operational
– So may have to find good approximation to V
– Call this approximation V
30
How Might Target Function be Represented?
– As collection of rules ?
– As neural network ?
– As polynomial function of board features ?
• Example of linear function of board features:
31
Inductive and Deductive Learning
• Inductive Learning: Reasoning from a set of examples to
produce a general rules. The rules should be applicable to new
examples, but there is no guarantee that the result will be
correct.
• Deductive Learning: Reasoning from a set of known facts and
rules to produce additional rules that are guaranteed to be
true.
32
Assessment of Learning Algorithms
• The most common criteria for learning algorithms assessments
are:
– Accuracy (e.g. percentages of correctly classified +’s and –’s)
– Efficiency (e.g. examples needed, computational tractability)
– Robustness (e.g. against noise, against incompleteness)
– Special requirements (e.g. incrementality, concept drift)
– Concept complexity (e.g. representational issues – examples &
bookkeeping)
– Transparency (e.g. comprehensibility for the human user)
33
DATA MINING
DATA SCIENCE
Data mining dan algorithms
• Data Mining
– The desired outcome from data mining is to create a model from a given dataset
that can have its insights generalized to similar datasets. A real-world example of a
successful data mining application can be seen in automatic fraud detection from
banks and credit institutions.
– Data mining is the process of discovering predictive information from the analysis
of large databases. For a data scientist, data mining can be a vague and daunting
task – it requires a diverse set of skills and knowledge of many data mining
techniques to take raw data and successfully get insights from it. You’ll want to
understand the foundations of statistics, and different programming languages that
can help you with data mining at scale.
Teknik-teknik Data mining
• Regression – Estimating the relationships between variables by optimizing the reduction of error.
• Classification – Identifying what category an object belongs to. An example is classifying email as
spam or legitimate, or looking at a person’s credit score and approving or denying a loan request.
• Cluster Analysis – Finding natural groupings of data objects based upon the known characteristics
of that data. An example could be seen in marketing, where analysis can reveal customer groupings
with unique behavior – which could be applied in business strategy decisions.
• Association and Correlation Analysis – Looking to see if there are unique relationships between
variables that are not immediately obvious. An example would be the famous case of beer and
diapers: men who bought diapers at the end of the week were much more likely to buy beer, so
stores placed them close to each other to increase sales.
• Outlier analysis – Examining outliers to examine potential causes and reasons for said outliers. An
example of which is the use of outlier analysis in fraud detection, and trying to determine if a
pattern of behavior outside the norm is fraud or not.
Contoh pemakaian panda pada model regresi pada Python
https://www.springboard.com/blog/data-mining-python-tutorial/
Python script
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import seaborn as sns
from matplotlib import rcParams
df = pd.read_csv('/Users/python/kc_house_data.csv')
df.head()
Tampilan Hasil