Big Data Mining and Analytics Notes
Big Data Mining and Analytics Notes
Key Concepts
1. Random Variables and Probability Distributions
A random variable represents a data attribute whose values result from some
probabilistic process.
The probability distribution defines the likelihood of different outcomes.
Common distributions:
o Bernoulli (binary outcomes)
o Gaussian/Normal (continuous, bell-shaped)
Statistical classification models predict the class label of an instance based on estimated
probabilities.
For an instance with features X=(x1,x2,...,xn)X = (x_1, x_2, ..., x_n), the goal is to
compute the probability of class CC:
P(C∣X)=P(X∣C)P(C)/P(X)
P(X∣C)=∏i=1n P(x_i | C)
Dataset:
Email ID Contains "buy" Contains "free" Contains "click" Class (Spam/Not Spam)
1 Yes No Yes Spam
2 No Yes No Not Spam
3 Yes Yes Yes Spam
4 No No Yes Not Spam
P(Spam)=2/4=0.5
P(Not Spam)=2/4=0.5
Suppose a new email contains "buy" = Yes, "free" = No, "click" = Yes. We want to predict if it's
spam.
Compute:
P(Spam∣X)∝P(Spam)×P(buy=Yes∣Spam)×P(free=No∣Spam)×P(click=Yes∣Spam) |
=0.5×1.0×0.5×1.0=0.25
Similarly,
P (Not Spam∣X)∝0.5×0×0.5×0.5=0
Summary
Statistical modeling provides a probabilistic framework for data classification and
prediction.
Naive Bayes is a foundational statistical model that is widely used due to its simplicity
and surprisingly good performance.
Estimation of prior and conditional probabilities is key.
Model evaluation is necessary to ensure accuracy.
If you'd like, I can provide detailed notes on other types of statistical models from the same
book, like Logistic Regression or Bayesian Networks, or expand this section further!
Certainly! Building upon our previous discussions, let's delve deeper into Statistical Modeling
as presented in Chapter 8 of Data Mining: Concepts and Techniques by Han, Kamber, and Pei.
This chapter provides an extensive exploration of classification methods, including decision
trees, Bayes classifiers, and rule-based systems, among others.
Classification is a fundamental task in data mining that involves predicting the categorical label
of a given data instance based on its attributes. The process typically follows these steps:
1. Model Construction: Using a training dataset to construct a model that can classify data
instances.
2. Model Evaluation: Assessing the model's performance using a separate test dataset.
3. Model Usage: Applying the model to classify new, unseen data instances.
Decision trees are a popular classification method that partition the data into subsets based on
feature values, leading to a tree-like structure. Each internal node represents a decision on an
attribute, each branch represents an outcome of the decision, and each leaf node represents a
class label.
Key Concepts:
Splitting Criteria: Measures like Information Gain and Gini Index are used to select the
best attribute to split the data.
Overfitting: Trees that are too deep may overfit the training data. Pruning techniques are
applied to remove unnecessary branches.
Handling Continuous Attributes: Continuous attributes are handled by selecting a
threshold value to split the data.
Example:
Consider a dataset with attributes like Age, Income, and Student Status, and a target variable
"Buys Computer". A decision tree might first split on "Student Status", then on "Income", and
finally on "Age", leading to a classification of "Yes" or "No" for each instance.
Bayesian classifiers apply Bayes' Theorem to predict the probability of each class given the
attributes of a data instance.
Assumes that all attributes are conditionally independent given the class label.
Computes the posterior probability for each class and assigns the class with the highest
probability.
Example:
Given a dataset of emails labeled as "Spam" or "Not Spam", attributes might include the
presence of words like "buy", "free", and "click". The Naive Bayes classifier calculates the
probability of each class based on the frequency of these words in the emails.
Example:
For a dataset with attributes like Age and Income, the Gaussian Naive Bayes classifier would
model the distribution of these attributes for each class and use them to compute the likelihood of
a new instance belonging to each class.
Key Concepts:
Rule Generation: Rules are generated from the training data using algorithms like
RIPPER or CN2.
Rule Pruning: Irrelevant or redundant rules are removed to improve model performance.
Rule Evaluation: Rules are evaluated based on metrics like coverage and accuracy.
Example:
A rule might state: "IF Age > 30 AND Income > 50K THEN Buys Computer = Yes". The
classifier applies these rules to classify new instances.
Evaluation Metrics:
Cross-Validation:
k-Fold Cross-Validation: The dataset is divided into k subsets. The model is trained on
k-1 subsets and tested on the remaining subset. This process is repeated k times, and the
average performance is computed.
Leave-One-Out Cross-Validation (LOOCV): A special case of k-Fold Cross-
Validation where k equals the number of data instances.
If you would like further details on any of these topics or additional resources, feel free to ask!