12/10/23(Thursday)
Big Data
L-1
1. What is big data?
Matured Domain
of Statistics
Young Domain of
Computer Science
Big Data is the combination of the very mature domain of statistics with the relatively
young domain of computer science. As such, it builds upon the collective knowledge of
mathematics, statistics and data analysis techniques in general.
L-2
2 data operations
DATA OPERATIONS
Statistical operations
U mean,vw min,
Umax,
v Probability distribution, and
VU regression.
Machine learning operations
V linear regression,
v logistic regression,
uv classiication, and
clustering
Universal data processing operations are as follows:
* Data cleaning: This option is to clean massive datasets
* Data exploration: This option is to explore all the possible values of datasets
* Data analysis: This option is to perform analytics on data with descriptive
and predictive analytics data visualization, that is, visualization of analysis
output programming.
3. Data analytics project cycle
1. Identifying the problem
2. Designing data requirement
3. Preprocessing data
4. Performing analytics over data4. Exploring web pages categorisation
Web pages Popular categorization
High
Home Page Home Page
Services
Support Medium
Products Service
Contact Us Products
About Us tow
Support
Contact Us
About Us
L-4
1. Statistics
Population
In statistics, population is the entire set of items from which you draw data for
a statistical study. It can be a group of individuals, a set of items, etc. It
makes up the data pool for a study.
Sample
Asample is defined as a smaller and more manageable representation of a
larger group. A subset of a larger population that contains characteristics of
that population. A sample is used in statistical testing when the populationsize is too large for all members or observations to be included in the test.
The sample is an unbiased subset of the population that best represents the
whole data.
Parameter
A population parameter is a numerical value that describes a characteristic of a
population, such as the mean or standard deviation. It is usually unknown and is
estimated from sample data. For example, the population mean height of all students
in a school is a population parameter.
Statistics
A sample statistics, on the other hand, is a numerical value that describes a
characteristic of a sample, such as the sample mean or sample standard deviation. It
is calculated from sample data and used to make inferences about the population.
For example, the sample mean height of a group of randomly selected students is a
sample statistic.
Difference between parameter and statistics
The key difference between a population parameter and a sample statistic is that the
former describes the entire population, while the latter describes only a sample from
the population. In general, population parameters are more precise and accurate, as
they are calculated using all available data. However, they are usually unknown and
can only be estimated from sample data, which is where sample statistics come into
play.
Random Variable
A random variable is a variable whose value is unknown or a function that assigns
values to each of an experiment's outcomes. The use of random variables is most
common in probability and statistics, where they are used to quantify outcomes of
random occurrences.
Probability Distribution
A probability distribution is an idealized frequency distribution. A frequency
distribution describes a specific sample or dataset. It's the number of times each
possible value of a variable occurs in the dataset. The number of times a value occurs
in a sample is determined by its probability of occurrence.
2. Statistics (function)Example — Normal distribution
Normal distribution, also known as the Gaussian distribution, is a probability
distribution that is symmetric about the mean, showing that data near the
mean are more frequent in occurrence than data far from the mean. In
graphical form, the normal distribution appears as a "bell curve".
1 1/2-H#y2
= —3 (>)
z)= e 2a
f(z) oV2n
£(2) = probability density function
= standard deviation
# =mean
DRAWING SAMPLES FROM PROBABILITY
DISTRIBUTION
Uniform random numbers
Syntax: runif(n, min=0, max=1)
> runif(1)
> runif(10)
Normally distributed random numbers
Syntax: rnorm(n, mean = 0, sd = 1)
>rnorm(1)
> rnorm(15, mean = 5.4, sd = 0.5)
L-5
1. Machine learning (types of machine
learning)uv Machine learning is a branch of artificial intelligence (Al) and computer
science which focuses on the use of data and algorithms to imitate the way
that humans learn, gradually improving its accuracy.
Types of Machine Learning
v Supervised Learning
v Unsupervised Learning
uv Semi-supervised Learning
v Active Learning
uv Reinforcement Learning
2. Supervised learning regression (In
function)
uv Supervised learning, also known as supervised machine learning, is a subcategory of
machine learning and artificial intelligence.
u It is defined by its use of labeled datasets to train algorithms that would be used to
classify data or predict outcomes accurately.
v As input data is fed into the model, it adjusts its weights until the model has been
fitted appropriately, which occurs as part of the cross validation process.
# Values of height
x = 151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
y =63, 81, 56, 91, 47, 57, 76, 72, 62, 48
Im() Function
This function creates the relationship model between the predictor and the
response variable.
relation = Im(y~x)
# Find weight of a person with height 170.
a 2. data framaly = 179)result <- predict(relation,a)
print(result)
3. Active learning
V Active learning is a special case of machine learning in which a learning algorithm
can interactively query a user (or some other information source) to label new data
points with the desired outputs.
u In statistics literature, it is sometimes also called optimal experimental design.
v The information source is called teacher or oracle
L-6
1. Multiple linear regression
Y Multiple independent variables have impact on a single dependent variable
Y = Bo + Bix1 + Boxe ++. +BnXn
Example
Independent variables:
X1 - transaction date
X2- house age
X3- distance to the nearest MRT station
X4- number of convenience stores
Xs - latitude
Xe - longitude
Dependent variable
Y - house price of unit area
EXAMPLE WITH R
library(ggplot2)
library(caret)
Ube ec rccflecesine’ee
houseData =read.csv(fil
str(houseData)
colnames(houseData)[1] <- "SL"
colnames(houseData)[2] <- "date"
colnames(houseData)[3] <- "age"
colnames(houseData)[4] <- "dist"
colnames(houseData)[5] <- "store"
colnames(houseData)[6] <- "lat"
colnames(houseData)[7] <- “long”
colnames(houseData)[8] <- “price”
="RealEstate.csv", header=TRUE, sep=",")
hD = subset(houseData, select = -c(SL, date, lat, long))
print(colnames(hD))
fitO = Im(price~age, data=hD)
fit] = Im(price~dist, data=hD)
fit2 = Im(price~store, data=hD)
fit = Im(price~., data=hD)
fit$coeff
residuals(fit)
hist(residvals(fit))
PREDICTION
d <- data.frame(age = 0, dist=3, store=15)
result <- predict(fit,d)
print(result)
print(colnames(hD))
2. decision treev A decision tree is a machine learning method for classification and prediction and for
facilitating decision-making in sequential decision problems.
u Supervised learning
v Three types of Decision Trees
1. Decision Trees to recommend a course of action based on a sequence of
information nodes
2. Classification and Regression trees and
3. Survival Trees.
3.example of decision tree
Suppose we want to build a decision tree to predict whether a person is likely to buy
a new car based on their demographic and behavior data.
uv The decision tree starts with the root node, which represents the entire dataset.
The root node splits the dataset based on the “income” attribute. If the person's
income is less than or equal to $50,000, the decision tree follows the left branch, and
if the income is greater than $50,000, the decision tree follows the right branch.
V The left branch leads to a node that represents the “age” attribute. If the person's
age is less than or equal to 30, the decision tree follows the left branch, and if the
age is greater than 30, the decision tree follows the right branch.
The right branch leads to a leaf node that predicts that the person is unlikely to buy a
new car.
EXAMPLE DECISION TREE
Income <= $50,000Let us learn from an example - Dataset: Golf playing decision
Day Outlook Temp Humidity Wind Decision
4 _| Sunny | Hot High | Weak | No
2 | Sunny | Hot T High | Strong | No
[3 | Overcast Hot High | Weak | Yes
4 Rain Mild High Weak Yes
5 Rain | Cool Normal Weak Yes
6 | Rain Cool Normal | strong | No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
| 10 | Rain | Mila | Normal | Weak | Yes
n ‘| Sunny Mild | Normal | Strong [Yes
12 Overcast Mild High Strong Yes
13__| Overcast | Hot Normal | Weak | Yes
4 Rain Mild High Strong No
EXAMPLE
Sunny = 2, overcast = 1, rain = 0
Hot = 2, mild = 1, cool = 0
Humidity high = 2, normal =
Wind Strong = 2, weak = 0
Decision Yes = 1, No = 0
golfData =read.csv(file="golfCSV.csv", header=TRUE, sep=",")
gD = subset(golfData, select = -c(Day))
4.decision tree in R(syntax)
Package ‘rpart’ - Recursive Partitioning and Regression Trees
Rpart is a powerful machine learning library in R that is used for building
classification and regression trees. This library implements recursive partitioning and is
very easy to use.The R function rpart is an implementation of the CART [Classification and Regression
Tree] supervised machine learning algorithm used to generate a decision tree.
Function rpart
Syntax:
rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE,
x = FALSE, y = TRUE, parms, control, cost, ...)
5. Rpart example
library(rpart)
golfData =read.csv(file="golfCSV.csv", header=TRUE, sep=",")
gD = subset(golfData, select = -<(Day))
fit = rpart(Decision~Outlook+Temp+Humidity+Wind, gD, method="class",
minsplit = 2, minbucket = 1)
d <- data.frame(Outlook=1, Temp=0 , Humidity=0, Wind = 0)
result <- predict(fit, d)
print(result)
minsplit is the minimum number of observations that must exist in a node in
order for a split to be attempted
minbucket is the minimum number of observations in any terminal node.
library(rpart.plot)
rpart.plot(fit)
L-9
1.DFSee a ea re ne ee ee eee
This is an area of active research interest today.
Clients should view a DFS the same way they would a centralized
FS; the distribution is hidden at a lower level.
A DFS provides high throughput data access and fault tolerance.
2.Hadoop stack
Hadoop Stack
Hadoop Internal Software Architecture
Hatoopecasystem nga [5 [ool
a Languages
feel ams |e
E — = -
ili TD Row /column data \ File data
: He Distributed data procesing
i le