Intro To Data Science Summary
Intro To Data Science Summary
Conclusion: After knowing your variables, take care of the missing values and the type of the
variables, and then use visualizations to draw an analysis of your variables.
Lecture 5 - kNN
Lazy Learning – Classification Using Nearest Neighbors
NN classifiers: classify unlabeled examples by assigning them the class of the most similar
labeled examples.
Strengths:
● Simple and effective
● Makes no assumptions about the underlying data distribution
● Fast Training Phase
Weaknesses:
● Does not produce a model which limits the ability to find novel insights in relationships
among features
● Slow Classification Phase
● Requires a large amount of memory
Strengths:
● An all-purpose classifier that does well on most problems
● Can handle numeric or nominal features, missing data
● Uses only the most important features
● Can be used on data with relatively few training examples
Weaknesses:
● Often biased toward splits on features having a large number of levels
● It is easy to overfit or underfit the model
● Can have a trouble modeling some relationships due to reliance on axis-parallel splits
● Small changes in training data can result in large changes to decision logic
● Large trees can be difficult to interpret and the decisions they make may seem
counterintuitive.
Choosing the best split
C5.0 and many decision trees algorithms use entropy for measuring purity, the best split is the
split that results in the purest partition (If the segments of data contain only a single class, they
are considered pure. )
The entropy of a sample of data indicates how mixed the class values are, the lowest is 0
which means that the data is completely homogeneous, while the value 1 indicates the highest
amount of disorder.
For a given segment of data (S):
c: the number of different class levels
p(i): the proportion of values falling into a class level I
The higher the information gain, the better a feature is set creating a homogeneous group after
a split on that feature
A decision tree can continue to grow indefinitely, choosing splitting features and dividing into
smaller and smaller partitions until each example is perfectly classified or the algorithm runs out
of features to split on.
If the tree grows overly large, many of the decisions it makes will be overly specific and the
model will have been overfitted to the training data.
The process of pruning the decision tree is reducing its size that generalizes better for unseen
data.
Pre-pruning: stop the tree from growing and doing needless work once it reaches a certain
number of decisions or if the decision nodes contain only a small number of examples.
Post-pruning: Growing a tree that is too large, pruning criteria based on the error rates at the
nodes to reduce the size of the tree to a more appropriate level.
Defining a cost matrix: This creates a matrix with two rows and two columns, that specify the
cost of each decision.
Rule-based learners should be the baseline for our model(Read More about it in the slides).
Data Cleaning
Missing Values
1. Mark invalid or corrupt values as missing in your dataset.
2. Confirm that the presence of marked missing values causes problems for learning
algorithms.
3. You may choose to remove rows with missing data from your dataset and evaluate a
learning algorithm on the transformed dataset.
4. Can be too limiting on some predictive modeling problems, say if more than 30% of the
data is missing.
Feature Selection
Feature selection is the process of reducing the number of input variables when developing a
predictive model. It is desirable to reduce the number of input variables to both reduce the
computational cost of modeling and, in many cases, to improve the performance of the model.
Statistical-based feature selection methods involve evaluating the relationship between each
input variable and the target variable using statistics and selecting those input variables that
have the strongest relationship with the target variable.
Unsupervised Selection: Do not use the target variable (e.g. remove redundant variables).
Supervised Selection: Use the target variable (e.g. remove irrelevant variables).
● Intrinsic: Algorithms that perform automatic feature selection during training.
● Filter: Select subsets of features based on their relationship with the target.
● Wrapper: Search subsets of features that perform according to a predictive model.
Feature selection is also related to dimensionality reduction techniques in that both methods
seek fewer input variables to a predictive model. The difference is that feature selection selects
features to keep or remove from the dataset, whereas dimensionality reduction creates a
projection of the data resulting in entirely new input features. As such, dimensionality reduction
is an alternate to feature selection rather than a type of feature selection
A. This is a regression predictive modeling problem with numerical input variables. The
most common techniques are to use a correlation coefficient, such as Pearson’s for a
linear correlation, or rank-based methods for a nonlinear correlation.
B. This is a classification predictive modeling problem with numerical input variables. This
might be the most common example of a classification problem, Again, the most
common techniques are correlation-based, although in this case, they must take the
categorical target into account.
C. This is a regression predictive modeling problem with categorical input variables.
Nevertheless, you can use the same Numerical Input, Categorical Output methods
(described above), but in reverse.
D. This is a classification predictive modeling problem with categorical input variables, The
most common correlation measure for categorical data is the chi-squared test. You can
also use mutual information (information gain) from the field of information theory.
E.
Feature Importance
Feature importance refers to a class of techniques for assigning scores to input features to a
predictive model that indicates the relative importance of each feature when making a
prediction.
The scores can be used for a better understanding of the data, the model, and reducing the
number of input features.
Coefficients as Feature Importance: Linear machine learning algorithms fit a model where the
prediction is the weighted sum of the input values. Examples include linear regression, logistic
regression, and extensions that add regularization, such as ridge regression, LASSO, and the
elastic net.
Data transforms
Data transforms to change the type of distribution of data variables, which may be applied to
input as well as output variables.
Types of transforms:
● Discretization Transform: Encode a numeric variable as an ordinal variable
● Ordinal Transform: Encode a categorical variable into an integer variable
● One Hot Transform: Encode a categorical variable into binary variables
● Binary Hot Transform (a space-efficient adaptation of One Hot Transform)
A power transform will make the probability distribution of a variable more Gaussian.
We can apply a power transform directly by calculating the log or square root of the variable,
although this may or may not be the best power transformer for a given variable.
Instead, we can use a generalized version of the transform that finds a parameter (lambda or λ)
that best transforms a variable to a Gaussian probability distribution.
Uniform: Each bin has the same width in the span of possible values for the variable.
Quantile: Each bin has the same number of values, split based on percentiles.
Clustered: Clusters are identified and examples are assigned to each group.
Feature Engineering
Creating new input variables from the available data In order to expose clearer trends in the
input data by adding a broader context to a single observation or decomposing a complex
variable.
Some common techniques:
● Adding a Boolean flag variable for some state.
● Adding a group or global summary statistic, such as a mean.
● Adding new variables for each component of a compound variable, such as a date-time.
● Polynomial Transform: Create copies of numerical input variables that are raised to a
power
● Feature Crossing
Feature Crossing
A feature cross is a synthetic feature that encodes nonlinearity in the feature space by
multiplying two or more input features together.
Dimensionality Reduction
Dimensionality reduction techniques create a projection of the data into a lower-dimensional
space that still preserves the most important properties of the original data.
Techniques from linear algebra can be used for dimensionality reduction. Specifically,
matrix factorization methods can be used to reduce a dataset matrix into its constituent
parts.
Manifold Learning: Techniques from high-dimensionality statistics are used to create a
low-dimensional projection of high-dimensional data, often for the purposes of data
visualization.
Many statistical models suffer from a high correlation between covariates, PCA can be
used to produce linear combinations of the covariates that are uncorrelated between
each other.
In PCA, we simplify a dataset with many variables by turning the original variables into a
smaller number of "Principal Components".
The algorithm can be sensitive to randomly chosen cluster centers. Choosing the
number of clusters requires a delicate balance: Setting k to be very large will improve the
homogeneity of the clusters, but risks overfitting the data.
Commonly used for modeling complex relationships among data elements. Estimating the
impact of a treatment on an outcome, and extrapolate into the future.
Example of regression models: simple linear regression, multiple regression, logistic regression,
Poisson regression.
The denominator for b is the variance of x. The numerator is the covariance of x and y.
Correlations
The correlation between two variables indicates how closely their relationship follows a straight
line. The correlation ranges between -1 and +1. The extreme values indicate a perfectly linear
relationship. A correlation close to zero indicates the absence of a linear relationship. The
following formula defines Pearson's correlation:
Multiple regression is an extension of simple linear regression, find values of beta coefficients
that minimize the prediction error of a linear equation.
Minimizing error:
How to solve for the vector β that minimizes the sum of the squared errors between the
predicted and actual y values?
It has been shown in the literature that the best estimate of the vector β can be computed as: