0% found this document useful (0 votes)

25 views25 pages

9 MidReview

The document discusses a midterm review for a seminar on support vector machines. It covers topics like data mining techniques for classification, the knowledge discovery process, and data warehousing concepts such as star schemas and data cubes. It also addresses issues with data preprocessing like handling missing data, and the goals of tasks in preprocessing like data cleaning, integration, transformation and reduction.

Uploaded by

Ngọc Lợi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views25 pages

9 MidReview

Uploaded by

Ngọc Lợi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Midterm: Review

Lecturer: Dr. Nguyen Thi Ngoc Anh

Email: [email protected]

Seminar: Support Vector Machines

— Massive Data Mining via Support Vector
Machines
— Support Vector Machines for:
◦ classifying from large datasets
◦ single-class classification
◦ discriminant feature combination discovery

1
Data Mining: Classification Schemes
— General functionality
◦ Descriptive data mining
◦ Predictive data mining
— Different views, different classifications
◦ Kinds of data to be mined
◦ Kinds of knowledge to be discovered
◦ Kinds of techniques utilized
◦ Kinds of applications adapted

Knowledge Discovery in Databases:

Process
Interpretation/
Evaluation

Data Mining Knowledge

Preprocessing
Patterns

Selection
Preprocessed
Data
Data
Target
Data

adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

2
What Can Data Mining Do?
— Cluster
— Classify
◦ Categorical, Regression
— Summarize
◦ Summary statistics, Summary rules
— Link Analysis / Model Dependencies
◦ Association rules
— Sequence analysis
◦ Time-series analysis, Sequential associations
— Detect Deviations

What is Data Warehouse?

— Defined in many different ways, but not rigorously.
◦ A decision support database that is maintained separately from the
organization’s operational database

◦ Support information processing by providing a solid platform of consolidated,

historical data for analysis.

— “A data warehouse is a subject-oriented, integrated, time-variant, and

nonvolatile collection of data in support of management’s decision-making
process.”—W. H. Inmon
— Data warehousing:
◦ The process of constructing and using data warehouses

3
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
7

From Tables and Spreadsheets to Data

Cubes
— A data warehouse is based on a multidimensional data model which views
data in the form of a data cube
— A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
◦ Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year)
◦ Fact table contains measures (such as dollars_sold) and keys to each of the
related dimension tables
— In data warehousing literature, an n-D base cube is called a base cuboid.
The top most 0-D cuboid, which holds the highest-level of summarization,
is called the apex cuboid. The lattice of cuboids forms a data cube.

4
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid

time item location supplier

1-D cuboids

time,location item,location location,supplier

time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D(base) cuboid
time, item, location, supplier

A Sample Data Cube

Total annual sales
Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country

sum
Canada

Mexico

sum

5
Warehouse Summary
— Data warehouse
— A multi-dimensional model of a data warehouse
◦ Star schema, snowflake schema, fact constellations
◦ A data cube consists of dimensions & measures
— OLAP operations: drilling, rolling, slicing, dicing and
pivoting
— OLAP servers: ROLAP, MOLAP, HOLAP
— Efficient computation of data cubes
◦ Partial vs. full vs. no materialization
◦ Multiway array aggregation
◦ Bitmap index and join index implementations
— Further development of data cube technology
◦ Discovery-drive and multi-feature cubes
◦ From OLAP to OLAM (on-line analytical mining)

Data Preprocessing
— Data in the real world is dirty
◦ incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
– e.g., occupation=“”
◦ noisy: containing errors or outliers
– e.g., Salary=“-10”
◦ inconsistent: containing discrepancies in codes or
names
– e.g., Age=“42” Birthday=“03/07/1997”
– e.g., Was rating “1,2,3”, now rating “A, B, C”
– e.g., discrepancy between duplicate records

6
Multi-Dimensional Measure of Data
Quality
— A well-accepted multidimensional view:
◦ Accuracy
◦ Completeness
◦ Consistency
◦ Timeliness
◦ Believability
◦ Value added
◦ Interpretability
◦ Accessibility
— Broad categories:
◦ intrinsic, contextual, representational, and accessibility.

Major Tasks in Data Preprocessing

— Data cleaning
◦ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
— Data integration
◦ Integration of multiple databases, data cubes, or files
— Data transformation
◦ Normalization and aggregation
— Data reduction
◦ Obtains reduced representation in volume but produces the
same or similar analytical results
— Data discretization
◦ Part of data reduction but with particular importance, especially
for numerical data

7
How to Handle Missing Data?
— Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when
the percentage of missing values per attribute varies
considerably.
— Fill in the missing value manually: tedious + infeasible?
— Fill in it automatically with
◦ a global constant : e.g., “unknown”, a new class?!
◦ the attribute mean
◦ the attribute mean for all samples belonging to the same class:
smarter
◦ the most probable value: inference-based such as Bayesian
formula or decision tree

How to Handle Noisy Data?

— Binning method:
◦ first sort data and partition into (equi-depth) bins
◦ then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
— Clustering
◦ detect and remove outliers
— Combined computer and human inspection
◦ detect suspicious values and check by human (e.g.,
deal with possible outliers)
— Regression
◦ smooth by fitting the data into regression functions

8
Data Transformation
— Smoothing: remove noise from data
— Aggregation: summarization, data cube
construction
— Generalization: concept hierarchy climbing
— Normalization: scaled to fall within a small,
specified range
◦ min-max normalization
◦ z-score normalization
◦ normalization by decimal scaling
— Attribute/feature construction
◦ New attributes constructed from the given ones

Data Reduction Strategies

— A data warehouse may store terabytes of data
◦ Complex data analysis/mining may take a very long time to run
on the complete data set
— Data reduction
◦ Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
— Data reduction strategies
◦ Data cube aggregation
◦ Dimensionality reduction — remove unimportant attributes
◦ Data Compression
◦ Numerosity reduction — fit data into models
◦ Discretization and concept hierarchy generation

9
Principal Component Analysis
— Given N data vectors from k-dimensions, find c
≤ k orthogonal vectors that can be best used
to represent data
◦ The original data set is reduced to one consisting of
N data vectors on c principal components (reduced
dimensions)
— Each data vector is a linear combination of the c
principal component vectors
— Works for numeric data only
— Used when the number of dimensions is large

Discretization
— Three types of attributes:
◦ Nominal — values from an unordered set
◦ Ordinal — values from an ordered set
◦ Continuous — real numbers
— Discretization:
◦ divide the range of a continuous attribute into
intervals
◦ Some classification algorithms only accept categorical
attributes.
◦ Reduce data size by discretization
◦ Prepare for further analysis

10
Data Preparation Summary
— Data preparation is a big issue for both
warehousing and mining
— Data preparation includes
◦ Data cleaning and data integration
◦ Data reduction and feature selection
◦ Discretization
— A lot a methods have been developed but
still an active area of research

Association Rule Mining

— Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
◦ Frequent pattern: pattern (set of items, sequence, etc.) that
occurs frequently in a database [AIS93]
— Motivation: finding regularities in data
◦ What products were often purchased together? — Beer and
diapers?!
◦ What are the subsequent purchases after buying a PC?
◦ What kinds of DNA are sensitive to this new drug?
◦ Can we automatically classify web documents?

11
Basic Concepts:
Association Rules
Transaction-id Items bought — Itemset X={x1, …, xk}
10 A, B, C — Find all the rules XàY with min
confidence and support
20 A, C
◦ support, s, probability that a
30 A, D transaction contains XÈY
40 B, E, F ◦ confidence, c, conditional
probability that a transaction
having X also contains Y.
Customer Customer
buys both buys diaper
Let min_support = 50%,
min_conf = 50%:
A à C (50%, 66.7%)
C à A (50%, 100%)
Customer
buys beer
24

The Apriori Algorithm—An Example

Itemset sup
Itemset sup
Database TDB {A} 2 L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E
1st scan {C} 3
{D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
CFrequency
2 ≥ 50%, Confidence
Itemset sup C2 Itemset 100%:
{A, B} 1
L2 Itemset sup
{A, C}
nd à C {A, B}
2A scan
2
{A, C} 2
{B, C} 2
{A, E} 1 BàE {A, C}
{B, C} 2 BC à E {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2
CE à B {B, E}
BE à C {C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
26

12
FP-Tree Algorithm
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
1. Scan DB once, find Header Table
frequent 1-itemset f:4 c:1
Item frequency head
(single item pattern) f 4
2. Sort frequent items in c 4 c:3 b:1 b:1
frequency descending a 3
order, f-list b 3 a:3 p:1
m 3
3. Scan DB again, p 3
m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1 27

Constrained Frequent Pattern Mining: A

Mining Query Optimization Problem
— Given a frequent pattern mining query with a set of constraints C,
the algorithm should be
◦ sound: it only finds frequent sets that satisfy the given
constraints C
◦ complete: all frequent sets satisfying the given constraints C are
found
— A naïve solution
◦ First find all frequent sets, and then test them for constraint
satisfaction
— More efficient approaches:
◦ Analyze the properties of constraints comprehensively
◦ Push them as deeply as possible inside the frequent pattern
computation.

13
Classification:
Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
29

Classification:
Use the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
30

14
Naïve Bayes Classifier
— A simplified assumption: attributes are conditionally
independent:
n
P( X | C i) = Õ P( x k | C i)
k =1
— The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
— No dependence relation between attributes
— Greatly reduces the computation cost, only count the
class distribution.
— Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)

Bayesian Belief Network

Family
Smoker
History
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1

~LC 0.2 0.5 0.3 0.9
LungCancer Emphysema

The conditional probability table

for the variable LungCancer:
Shows the conditional probability
PositiveXRay Dyspnea for each possible combination of its
parents n
P ( z1,..., zn) = Õ P ( z i | Parents ( Z i ))
Bayesian Belief Networks i =1

15
Decision Tree

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

Algorithm for Decision Tree Induction

— Basic algorithm (a greedy algorithm)
◦ Tree is constructed in a top-down recursive divide-and-conquer manner
◦ At start, all the training examples are at the root
◦ Attributes are categorical (if continuous-valued, they are discretized in
advance)
◦ Examples are partitioned recursively based on selected attributes
◦ Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
— Conditions for stopping partitioning
◦ All samples for a given node belong to the same class
◦ There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
◦ There are no samples left

16
Attribute Selection Measure:
Information Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n S contains si tuples of class Ci for i = {1, …, m}
n information measures info required to classify any
arbitrary tuple m
si si
I( s1,s2,...,sm ) = -å log 2
i =1 s s
n entropy of attribute A with values {a1,a2,…,av}
v
s1 j + ... + smj
E(A) = å I ( s1 j ,..., smj )
j =1 s

n information gained by branching on attribute A

Gain(A) = I(s 1, s 2 ,..., sm) - E(A)

Definition of Entropy
— Entropy H (X ) = å - P( x) log
xÎ AX
2 P( x)

— Example: Coin Flip

◦ AX = {heads, tails}
◦ P(heads) = P(tails) = ½
◦ ½ log2(½) = ½ * - 1
◦ H(X) = 1
— What about a two-headed coin?
— Conditional Entropy:
H (X |Y) = å P( y ) H ( X | y )
yÎ AY 36

17
Attribute Selection by Information
Gain Computation
g Class P: buys_computer = “yes”
5 4
g Class N: buys_computer = “no” E (age) = I (2,3) + I (4,0)
g I(p, n) = I(9, 5) =0.940 14 14
g Compute the entropy for age: 5
+ I (3,2) = 0.694
14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has 5 out
<=30 2 3 0.971 14
of 14 samples, with 2 yes’es and 3
30…40 4 0 0
no’s. Hence
>40 3 2 0.971
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) = I ( p, n) - E (age) = 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
Similarly,
>40 low yes fair yes
>40 low yes excellent no Gain(income) = 0.029
31…40 low yes excellent yes
<=30
<=30
medium
low
no
yes
fair
fair
no
yes
Gain( student ) = 0.151
>40
<=30
medium
medium
yes
yes
fair
excellent
yes
yes
Gain(credit _ rating ) = 0.048
31…40 medium no excellent yes
31…40 high yes fair yes
37
>40 medium no excellent no

Overfitting in Decision Trees

— Overfitting: An induced tree may overfit the training
data
◦ Too many branches, some may reflect anomalies due to noise or
outliers
◦ Poor accuracy for unseen samples
— Two approaches to avoid overfitting
◦ Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
– Difficult to choose an appropriate threshold
◦ Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
– Use a set of data different from the training data to decide
which is the “best pruned tree”

18
Artificial Neural Networks:
A Neuron
- mk
x0 w0
x1 w1
å f
output y
xn wn

Input weight weighted Activation

vector x vector w sum function
— The n-dimensional input vector x is mapped into variable y
by means of the scalar product and a nonlinear function
mapping
39

Artificial Neural Networks: Training

— The ultimate objective of training
◦ obtain a set of weights that makes almost all the tuples in the
training data classified correctly
— Steps
◦ Initialize weights with random values
◦ Feed the input tuples into the network one by one
◦ For each unit
– Compute the net input to the unit as a linear combination of all the
inputs to the unit
– Compute the output value using the activation function
– Compute the error
– Update the weights and the bias

19
SVM – Support Vector Machines

Small Margin Large Margin

Support Vectors

Non-separable case
When the data set is
non-separable as
shown in the right xT b + b 0 = 0
figure, we will assign
X
weight to each x*
X
support vector which X
will be shown in the X
constraint. C

20
Non-separable Cont.
1. Constraint changes to the following:
y ( xT b + b ), > C (1 - x ),Where
i i 0 i
N
"i, xi > 0, å xi < const.
i =1
2. Thus the optimization problem changes to:

Min|| b
ì
|| subject toï y ( xT b + b
i i 0 ) >1-xi ,i =1,..., N .
í N

ïî "i ,xi >0,å

i =1
xi < const .

General SVM

This classification problem

clearly do not have a good
optimal linear classifier.

Can we do better?
A non-linear boundary as
shown will do fine.

21
General SVM Cont.
— The idea is to map the feature space into
a much bigger space so that the boundary
is linear in the new space.
— Generally linear boundaries in the
enlarged space achieve better training-
class separation, and it translates to non-
linear boundaries in the original space.

Mapping
— Mapping F : � d a H
◦ Need distances in H: F ( xi ) × F ( x j )
— Kernel Function: K ( xi , x j ) = F ( xi ) × F ( x j )
-|| x - x ||2 / 2s 2
◦ Example: K ( xi , x j ) = e i j
— In this example, H is infinite-dimensional

22
The k-Nearest Neighbor Algorithm
— All instances correspond to points in the n-D space.
— The nearest neighbor are defined in terms of Euclidean
distance.
— The target function could be discrete- or real- valued.
— For discrete-valued, the k-NN returns the most common
value among the k training examples nearest to xq.
— Voronoi diagram: the decision surface induced by 1-NN for a
typical set of training examples.
_
_
_ _ .
+
_ .
+
xq + . . .
_ + .
47

Case-Based Reasoning
— Also uses: lazy evaluation + analyze similar instances
— Difference: Instances are not “points in a Euclidean
space”
— Example:Water faucet problem in CADET (Sycara et
al’92)
— Methodology
◦ Instances represented by rich symbolic descriptions (e.g.,
function graphs)
◦ Multiple retrieved cases may be combined
◦ Tight coupling between case retrieval, knowledge-based
reasoning, and problem solving
— Research issues
◦ Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases

23
Regress Analysis and Log-Linear
Models in Prediction
— Linear regression:Y = a + b X
◦ Two parameters , a and b specify the line and are to
be estimated by using the data at hand.
◦ using the least squares criterion to the known values
of Y1,Y2, …, X1, X2, ….
— Multiple regression:Y = b0 + b1 X1 + b2 X2.
◦ Many nonlinear functions can be transformed into the
above.
— Log-linear models:
◦ The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
◦ Probability: p(a, b, c, d) = aab baccad dbcd

Bagging and Boosting

— General idea
Classification method (CM)
Training data Classifier C
CM

Altered Training data Classifier C1

Altered Training data Classifier C2

……..
Aggregation …. Classifier C*

24
Test Taking Hints
— Open book/notes
◦ Pretty much any non-electronic aid allowed
— See old copies of my exams (and
solutions) at my web site
◦ CS 526
◦ CS 541
◦ CS 603
— Time will be tight
◦ Suggested “time on question” provided

Seminar Thursday:
Support Vector Machines
— Massive Data Mining via Support Vector
Machines
— Support Vector Machines for:
◦ classifying from large datasets
◦ single-class classification
◦ discriminant feature combination discovery

Python Project: Parking Management System
100% (4)
Python Project: Parking Management System
21 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
Data Mining & Business Intelligence
No ratings yet
Data Mining & Business Intelligence
322 pages
Unit-2
No ratings yet
Unit-2
144 pages
Data mining_concepts and techniques
No ratings yet
Data mining_concepts and techniques
13 pages
New Text Document
No ratings yet
New Text Document
3 pages
Data Warehousing - Data Mining CSE - IT (4th Year) Engineering Lecture Notes, Ebook PDF Download
No ratings yet
Data Warehousing - Data Mining CSE - IT (4th Year) Engineering Lecture Notes, Ebook PDF Download
146 pages
Data Mining - GDi Techno Solutions
No ratings yet
Data Mining - GDi Techno Solutions
145 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
DMDW Imp Ques
No ratings yet
DMDW Imp Ques
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Chapter 3
No ratings yet
Chapter 3
81 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
2 pages
Lecture 1.1.1 1.1.2
No ratings yet
Lecture 1.1.1 1.1.2
32 pages
Notes Module 2
No ratings yet
Notes Module 2
28 pages
dwm NOTES
No ratings yet
dwm NOTES
118 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Resume 1
100% (1)
Resume 1
106 pages
DM 6
No ratings yet
DM 6
29 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Data Warehousingdata Mining
No ratings yet
Data Warehousingdata Mining
86 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Warehouse and Data Mining Syllabus
No ratings yet
Data Warehouse and Data Mining Syllabus
5 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
Satyabhama Bigdata
No ratings yet
Satyabhama Bigdata
128 pages
SCSA3001-1-58
No ratings yet
SCSA3001-1-58
58 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
Data Preprocessing, Data Warehousing
No ratings yet
Data Preprocessing, Data Warehousing
9 pages
Unit 1: Data Warehousing & Data Mining
No ratings yet
Unit 1: Data Warehousing & Data Mining
54 pages
Data Mining
No ratings yet
Data Mining
4 pages
Data warehousing and Data Mining Unit 1,2,3 Q and A
No ratings yet
Data warehousing and Data Mining Unit 1,2,3 Q and A
41 pages
Unit I DWDM
No ratings yet
Unit I DWDM
26 pages
Data Warehousing and Data Mining - Handbook
0% (2)
Data Warehousing and Data Mining - Handbook
27 pages
Introduction to Data Warehouse
No ratings yet
Introduction to Data Warehouse
17 pages
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
No ratings yet
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
59 pages
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
0% (1)
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
20 pages
Data Mining Mid Syllabus
No ratings yet
Data Mining Mid Syllabus
162 pages
pptcs1661
No ratings yet
pptcs1661
38 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
2 pages
Lect 5
No ratings yet
Lect 5
31 pages
Unit 01 DWDM
No ratings yet
Unit 01 DWDM
105 pages
DM Unit2(Part1)
No ratings yet
DM Unit2(Part1)
19 pages
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
18 pages
Gujarat Technological University: Subject Name: Elective I - Data Warehousing & Data Mining (DWDM) Subject Code: 640005
No ratings yet
Gujarat Technological University: Subject Name: Elective I - Data Warehousing & Data Mining (DWDM) Subject Code: 640005
5 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Data Warehousing and Data Mining Syllabus
No ratings yet
Data Warehousing and Data Mining Syllabus
2 pages
CS-DM MODULE -1
No ratings yet
CS-DM MODULE -1
27 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
47 pages
Down 2
No ratings yet
Down 2
61 pages
DM Data transformation techniques
No ratings yet
DM Data transformation techniques
25 pages
1.3 Tasks of Data Mining
No ratings yet
1.3 Tasks of Data Mining
10 pages
Data Accquisition
No ratings yet
Data Accquisition
6 pages
DataMining S
No ratings yet
DataMining S
103 pages
UNIT-3 DATA MINING - Part1
No ratings yet
UNIT-3 DATA MINING - Part1
111 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Instagram To Sharia Economics: Impact and Benefits of Digital Literacy and Indonesia's Sharia Economy
No ratings yet
Instagram To Sharia Economics: Impact and Benefits of Digital Literacy and Indonesia's Sharia Economy
7 pages
Present Relevant Information
No ratings yet
Present Relevant Information
32 pages
IntervalMatch and Slowly Changing Dimensions
No ratings yet
IntervalMatch and Slowly Changing Dimensions
22 pages
Datalink Control
No ratings yet
Datalink Control
49 pages
Spare Parts List: R902482759 Drawing: Material Number
No ratings yet
Spare Parts List: R902482759 Drawing: Material Number
24 pages
Unit 4
No ratings yet
Unit 4
24 pages
Topic 4) Power BI Tutorial
100% (1)
Topic 4) Power BI Tutorial
47 pages
Sun RPC (Remote Procedure Call)
No ratings yet
Sun RPC (Remote Procedure Call)
20 pages
Assignment Elsa
No ratings yet
Assignment Elsa
15 pages
Task 4 - Jatin - Vishwakarma
0% (1)
Task 4 - Jatin - Vishwakarma
2 pages
Business Intelligence and Analytic Kds051 6516b706dec9c8980e5a67baf3fb8699
No ratings yet
Business Intelligence and Analytic Kds051 6516b706dec9c8980e5a67baf3fb8699
2 pages
CC Fraud Analytics Capstone
No ratings yet
CC Fraud Analytics Capstone
10 pages
DBA Interview
No ratings yet
DBA Interview
7 pages
Cerro San Cristobal Practica2, 1 en 20 Curvas de Nivel
No ratings yet
Cerro San Cristobal Practica2, 1 en 20 Curvas de Nivel
1 page
Paper 9005
No ratings yet
Paper 9005
12 pages
Week 1 StatisticsinBusiness WVegas
No ratings yet
Week 1 StatisticsinBusiness WVegas
7 pages
DMW Module 5
No ratings yet
DMW Module 5
126 pages
Course Schedule3
No ratings yet
Course Schedule3
3 pages
The Impact of Artificial Intelligence on Digital M
No ratings yet
The Impact of Artificial Intelligence on Digital M
20 pages
Cache Memory in Computer Organization
No ratings yet
Cache Memory in Computer Organization
5 pages
Week 2-LS3 DLL (Mean, Median, Mode and Range)
No ratings yet
Week 2-LS3 DLL (Mean, Median, Mode and Range)
7 pages
Csec - Information Technology - Paper03.jan2015o
No ratings yet
Csec - Information Technology - Paper03.jan2015o
15 pages
DP 600 Workshop
No ratings yet
DP 600 Workshop
174 pages
DataAnalytics Syllabus
No ratings yet
DataAnalytics Syllabus
5 pages
Using Basic Structured Query Language
No ratings yet
Using Basic Structured Query Language
55 pages
File Operation and Handling
No ratings yet
File Operation and Handling
13 pages
Description of Captured Data
No ratings yet
Description of Captured Data
4 pages
CSE2074
No ratings yet
CSE2074
3 pages

9 MidReview

Uploaded by

9 MidReview

Uploaded by

Midterm: Review

Lecturer: Dr. Nguyen Thi Ngoc Anh

Seminar: Support Vector Machines

Knowledge Discovery in Databases:

Data Mining Knowledge

What is Data Warehouse?

◦ Support information processing by providing a solid platform of consolidated,

— “A data warehouse is a subject-oriented, integrated, time-variant, and

From Tables and Spreadsheets to Data

time item location supplier

time,location item,location location,supplier

A Sample Data Cube

Major Tasks in Data Preprocessing

How to Handle Noisy Data?

Data Reduction Strategies

Association Rule Mining

The Apriori Algorithm—An Example

Constrained Frequent Pattern Mining: A

NAME RANK YEARS TENURED Classifier

Bayesian Belief Network

LC 0.8 0.5 0.7 0.1

The conditional probability table

student? yes credit rating?

no yes excellent fair

Algorithm for Decision Tree Induction

n information gained by branching on attribute A

Gain(A) = I(s 1, s 2 ,..., sm) - E(A)

— Example: Coin Flip

Overfitting in Decision Trees

Input weight weighted Activation

Artificial Neural Networks: Training

Small Margin Large Margin

ïî "i ,xi >0,å

This classification problem

Bagging and Boosting

Altered Training data Classifier C1

Altered Training data Classifier C2

You might also like