0% found this document useful (0 votes)
2 views

Data Analytics Notes

The document provides an overview of descriptive statistics, including measures of central tendency (mean, mode, median) and measures of variability (range, variance, standard deviation). It also explains probability distributions, random variables, and their types (discrete and continuous), along with formulas for calculating probabilities and expectations. Examples illustrate the application of these concepts in analyzing data sets and understanding statistical behavior.

Uploaded by

devnihore
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Analytics Notes

The document provides an overview of descriptive statistics, including measures of central tendency (mean, mode, median) and measures of variability (range, variance, standard deviation). It also explains probability distributions, random variables, and their types (discrete and continuous), along with formulas for calculating probabilities and expectations. Examples illustrate the application of these concepts in analyzing data sets and understanding statistical behavior.

Uploaded by

devnihore
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 44

Unit-1

UNIT-I: DESCRIPTIVE STATISTICS :Probability Distributions, Inferential


Statistics ,Inferential Statistics through hypothesis tests Regression & ANOVA ,Regression
ANOVA(Analysis of Variance)

Descriptive Statistic:-
Whenever we deal with some piece of data no matter whether it is small or stored in
huge databases statistics is the key that helps us to analyze this data and provide
insightful points to understand the whole data without going through each of the data
pieces in the complete dataset at hand. In this article, we will learn about Descriptive
Statistics and how actually we can use it as a tool to explore the data we have.

What are Descriptive Statistics?


In Descriptive statistics, we are describing our data with the help of various
representative methods using charts, graphs, tables, excel files, etc. In descriptive
statistics, we describe our data in some manner and present it in a meaningful way so
that it can be easily understood. Most of the time it is performed on small data sets and
this analysis helps us a lot to predict some future trends based on the current findings.
Some measures that are used to describe a data set are measures of central tendency
and measures of variability or dispersion.

Types of Descriptive Statistics

SKC Lakshmi Narain College of Technology, Indore


 Measures of Central Tendency
 Measure of Variability
 Measures of Frequency Distribution

Measures of Central Tendency


It represents the whole set of data by a single value. It gives us the location of
the central points. There are three main measures of central tendency:
 Mean
 Mode
 Median

Mean
It is the sum of observations divided by the total number of observations. It is
also defined as average which is the sum divided by count.

xˉ=∑xn xˉ=n∑x
where,
 x = Observations
 n = number of terms

Mode
It is the value that has the highest frequency in the given data set. The data set may have
no mode if the frequency of all data points is the same. Also, we can have more than one
mode if we encounter two or more data points having the same frequency.
Python

Median
It is the middle value of the data set. It splits the data into two halves. If the number of
elements in the data set is odd then the center element is the median and if it is even
then the median would be the average of two central elements.

SKC Lakshmi Narain College of Technology, Indore


Measure of Variability
Measures of variability are also termed measures of dispersion as it helps to gain
insights about the dispersion or the spread of the observations at hand. Some of the
measures which are used to calculate the measures of dispersion in the observations of
the variables are as follows:

 Range
 Variance
 Standard deviation

Range
The range describes the difference between the largest and smallest data point in our
data set. The bigger the range, the more the spread of data and vice versa.
Range = Largest data value – smallest data value

Variance
It is defined as an average squared deviation from the mean. It is calculated by finding
the difference between every data point and the average which is also known as the
mean, squaring them, adding all of them, and then dividing by the number of data points
present in our data set.
σ2=∑(x−μ)2/N-1
where,
 x -> Observation under consideration
 N -> number of terms
 mu -> Mean

Standard Deviation
It is defined as the square root of the variance. It is calculated by finding the Mean, then
subtracting each number from the Mean which is also known as the average, and
squaring the result. Adding all the values and then dividing by the no of terms followed
by the square root.
σ=∑(x−μ)2Nσ=N∑(x−μ)2
where,
 x = Observation under consideration
 N = number of terms
 mu = Mean

SKC Lakshmi Narain College of Technology, Indore


Descriptive Statistics Examples

Example 1:

Exam Scores Suppose you have the following scores of 20 students on an exam:

85, 90, 75, 92, 88, 79, 83, 95, 87, 91, 78, 86, 89, 94, 82, 80, 84, 93, 88, 81

To calculate descriptive statistics:

Mean: Add up all the scores and divide by the number of scores. Mean = (85 + 90 + 75
+ 92 + 88 + 79 + 83 + 95 + 87 + 91 + 78 + 86 + 89 + 94 + 82 + 80 + 84 + 93 + 88 +
81) / 20 = 1770 / 20 = 88.5

Median: Arrange the scores in ascending order and find the middle value. Median = 86
(middle value)

Mode: Identify the score(s) that appear(s) most frequently. Mode = 88

Range: Calculate the difference between the highest and lowest scores. Range = 95 - 75
= 20

Variance: Calculate the average of the squared differences from the mean. Variance =
[(85-88.5)^2 + (90-88.5)^2 + ... + (81-88.5)^2] / 20 = 33.25

Standard Deviation: Take the square root of the variance. Standard Deviation = √33.25 =
5.77

Probability Distribution

A probability distribution is an idealized frequency distribution. In statistics, a


frequency distribution represents the number of occurrences of different outcomes in a
dataset. It shows how often each different value appears within a dataset.
Probability distribution represents an abstract representation of the frequency
distribution. While a frequency distribution pertains to a particular sample or dataset,
detailing how often each potential value of a variable appears within it, the occurrence
of each value in the sample is dictated by its probability.

SKC Lakshmi Narain College of Technology, Indore


A probability distribution, not only shows the frequencies of different outcomes but
also assigns probabilities to each outcome. These probabilities indicate the likelihood
of each outcome occurring.

image paste
What is Probability Distribution?
In Probability Distribution, A Random Variable’s outcome is uncertain. Here, the
outcome’s observation is known as Realization. It is a Function that maps Sample
Space into a Real number space, known as State Space. They can be Discrete or
Continuous.
Probability Distribution Definition
The probability Distribution of a Random Variable (X) shows how the Probabilities of
the events are distributed over different values of the Random Variable. When all
values of a Random Variable are aligned on a graph, the values of its probabilities
generate a shape. The Probability distribution has several properties (for example:
Expected value and Variance) that can be measured. It should be kept in mind that the
Probability of a favorable outcome is always greater than zero and the sum of all the
probabilities of all the events is equal to 1.
Probability Distribution is basically the set of all possible outcomes of any random
experiment or event.

Random Variables
Random Variable is a real-valued function whose domain is the sample space of the
random experiment. It is represented as X(sample space) = Real number.
We need to learn the concept of Random Variables because sometimes we are just only
interested in the probability of the event but also in the number of events associated
with the random experiment. The importance of random variables can be better
understood by the following example:
Why do we need Random Variables?
Let’s take an example of the coin flips. We’ll start with flipping a coin and finding out
the probability. We’ll use H for ‘heads’ and T for ‘tails’.
So now we flip our coin 3 times, and we want to answer some questions.
1. What is the probability of getting exactly 3 heads?
2. What is the probability of getting less than 3 heads?
3. What is the probability of getting more than 1 head?
Then our general way of writing would be:
1. P(Probability of getting exactly 3 heads when we flip a coin 3 times)

SKC Lakshmi Narain College of Technology, Indore


2. P(Probability of getting less than 3 heads when we flip a coin 3 times)
3. P(Probability of getting more than 1 head when we flip a coin 3 times)

In a different scenario, suppose we are tossing two dice, and we are interested in
knowing the probability of getting two numbers such that their sum is 6.
So, in both of these cases, we first need to know the number of times the desired event
is obtained i.e. Random Variable X in sample space which would be then further used
to compute the Probability P(X) of the event. Hence, Random Variables come to our
rescue. First, let’s define what is random variable mathematically.

Random Variable Definition


Random Variable is a function that associates a real number with an event. It means
assigning a value (real number) to every possible outcome. In more mathematical
terms, it is a function from the sample space Ω to the real numbers. We can choose our
random variable according to our needs.

A random variable is a real valued function whose domain is the sample space of a
random experiment.

SKC Lakshmi Narain College of Technology, Indore


To understand this concept in a lucid manner, let us consider the experiment of tossing
a coin two times in succession.
The sample space of the experiment is S = {HH, HT, TH, TT}. Let’s define a random
variable to count events of head or tails according to our need, let X is a random
variable that denotes the number of heads obtained. For each outcome, its values are as
given below:
X(HH) = 2, X (HT) = 1, X (TH) = 1, X (TT) = 0.
More than one random variable can be defined in the same sample space. For example,
let Y is a random variable denoting the number of heads minus the number of tails for
each outcome of the above sample space S.

Y(HH) = 2-0 = 2; Y (HT) = 1-1 = 0; Y (TH) = 1-1= 0; Y (TT) = 0-2 = – 2


Thus, X and Y are two different random variables defined on the same sample.
Note: More than one event can map to same value of random variable.

Types of Random Variables in Probability Distribution


There are following two types of
 Discrete Random Variables
 Continuous Random Variables

1) Discrete Random Variables in Probability Distribution

A Discrete Random Variable can only take a finite number of values. To further
understand this, let’s see some examples of discrete random variables:
1. X = {sum of the outcomes when two dice are rolled}. Here, X can only take
values like {2, 3, 4, 5, 6….10, 11, 12}.
2. X = {Number of Heads in 100 coin tosses}. Here, X can take only integer values
from [0,100].

2) Continuous Random Variable in Probability Distribution

A Continuous Random Variable can take infinite values in a continuous domain. Let’s
see an example of a dart game.
Suppose, we have a dart game in which we throw a dart where the dart can fall
anywhere between [-1,1] on the x-axis. So if we define our random variable as the x-
coordinate of the position of the dart, X can take any value from [-1,1]. There are
infinitely many possible values that X can take. (X = {0.1, 0.001, 0.01, 1,2, 2.112121
…. and so on}

Probability Distribution of a Random Variable


Now the question comes, how to describe the behavior of a random variable?

SKC Lakshmi Narain College of Technology, Indore


Suppose that our Random Variable only takes finite values, like x 1, x2, x3 …. and xn.
ie. the range of X is the set of n values is {x 1, x2, x3 …. and xn}.
Thus, the behavior of X is completely described by giving probabilities for all the
values of the random variable X

Event Probability

x1 P(X = x1)

x2 P(X = x2)

x3 P(X = x3)

The Probability Function of a discrete random variable X is the function p(x) satisfying
P(x) = P(X = x)

SKC Lakshmi Narain College of Technology, Indore


Let’s look at an example:
Example: We draw two cards successively with replacement from a well-shuffled
deck of 52 cards. Find the probability distribution of finding aces.
Answer:
Let’s define a random variable “X”, which means number of aces. So since we are
only drawing two cards from the deck, X can only take three values: 0, 1 and 2. We
also know that, we are drawing cards with replacement which means that the two
draws can be considered an independent experiments.
P(X = 0) = P(both cards are non-aces)
= P(non-ace) x P(non-ace)
=
P(X = 1) = P(one of the cards in ace)
= P(non-ace and then ace) + P(ace and then non-ace)
= P(non-ace) x P(ace) + P(ace) x P(non-ace)
=
P(X = 2) = P(Both the cards are aces)
= P(ace) x P(ace)
=
Now we have probabilities for each value of random variable. Since it is discrete, we
can make a table to represent this distribution. The table is given below.
X 0 1 2

P(X=x)

It should be noted here that each value of P(X=x) is greater than zero and the sum of
all P(X=x) is equal to 1.

Probability Distribution Formulas

SKC Lakshmi Narain College of Technology, Indore


The various formulas under Probability Distribution are tabulated below:
Types of Distribution Formula

P(X) = nCxaxbn-x

Where a = probability of success


Binomial Distribution b=probability of failure
n= number of trials
x=random variable denoting success

Cumulative Distribution Function Fx(x) = f(x)(t)dt

P(x) = n!/ r!(n-r)! . pr(1-p)n-r


Discrete Probability Distribution
P(x) = C(n,r) . pr(1-p)n-r

Expectation (Mean) and Variance of a Random Variable


Suppose we have a probability experiment we are performing, and we have defined
some random variable(R.V.) according to our needs( like we did in some previous
examples). Now, each time experiment is performed, our R.V. takes on a different
value. But we want to know that if we keep on doing the experiment a thousand times
or an infinite number of times, what will be the average value of the random variable?
Expectation
The mean, expected value, or expectation of a random variable X is written as E(X)
or If we observe N random values of X, then the mean of the N values will be
approximately equal to E(X) for large N.
For a random variable X which takes on values x 1, x2, x3 … xn with probabilities p1, p2,
p3 … pn. Expectation of X is defined as,

i.e it is weighted average of all values which X can take, weighted by the probability of
each value.
To see it more intuitively, let’s take a look at this graph below,
Now in the above figure, we can see both the Random Variables have the almost same
‘mean’, but does that mean that they are equal? No. To fully describe the
properties/behavior of a random variable, we need something more, right?
We need to look at the dispersion of the probability distribution, one of them is
concentrated, but the other is very spread out near a single value. So we need a metric
to measure the dispersion in the graph.

SKC Lakshmi Narain College of Technology, Indore


Variance
In Statistics, we have studied that the variance is a measure of the spread or scatter in
the data. Likewise, the variability or spread in the values of a random variable may be
measured by variance.
For a random variable X which takes on values x 1, x2, x3 … xn with probabilities p1,
p2, p3 … pn and the expectation is E[X]
The variance of X or Var(X) is denoted
by,
Let’s calculate the mean and variance of a random variable probability
distribution through an example:
Example: Find the variance and mean of the number obtained on a throw of an
unbiased die.
Answer:
We know that the sample space of this experiment is {1,2,3,4,5,6}
Let’s define our random variable X, which represents the number obtained on a
throw.
So, the probabilities of the values which our random variable can take are,
P(1) = P(2) = P(3) = P(4) = P(5) = P(6) =
Therefore, the probability distribution of the random variable is,
X 1 2 3 4 5 6

Probabilities

E[X] =
Also, E[X2]

=
Thus, Var(X) = E[X2] – (E[X])2
=
So, therefore mean is and variance is
**********************************************************************

SKC Lakshmi Narain College of Technology, Indore


Different Types of Probability Distributions
We have seen what Probability Distributions are, now we will see different types of
Probability Distributions. The Probability Distribution’s type is determined by the type
of random variable. There are two types of Probability Distributions:
 Discrete Probability Distributions for discrete variables
 Cumulative Probability Distribution for continuous variables
We will study in detail two types of discrete probability distributions, others are out of
scope at class 12.

Discrete Probability Distributions


Discrete Probability Functions also called Binomial Distribution assume a discrete
number of values. For example, coin tosses and counts of events are discrete functions.
These are discrete distributions because there are no in-between values. We can either
have heads or tails in a coin toss.
For discrete probability distribution functions, each possible value has a non-zero
probability. Moreover, the sum of all the values of probabilities must be one. For
example, the probability of rolling a specific number on a die is 1/6. The total
probability for all six values equals one. When we roll a die, we only get either one of
these values.
Bernoulli Trials and Binomial Distributions
When we perform a random experiment either we get the desired event or we don’t. If
we get the desired event then we call it a success and if we don’t it is a failure. Let’s
say in the coin-tossing experiment if the occurrence of the head is considered a
success, then the occurrence of the tail is a failure.
Each time we toss a coin or roll a die or perform any other experiment, we call it a
trial. Now we know that in our experiments coin-tossing trial, the outcome of any trial
is independent of the outcome of any other trial. In each of such trials, the probability
of success or failure remains constant. Such independent trials that have only two
outcomes usually referred to as ‘success’ or ‘failure’ are called Bernoulli Trials.
Definition:
Trials of the random experiment are known as Bernoulli Trials, if they are satisfying
below given conditions :
 Finite number of trials are required.
 All trials must be independent.
 Every trial has two outcomes : success or failure.
 Probability of success remains same in every trial.
Let’s take the example of an experiment in which we throw a die; throwing a die 50
times can be considered as a case of 50 Bernoulli trials, where the result of each trial is
either success(let’s assume that getting an even number is a success) or

SKC Lakshmi Narain College of Technology, Indore


failure( similarly, getting an odd number is failure) and the probability of success (p) is
the same for all 50 throws. Obviously, the successive throws of the die are independent
trials. If the die is fair and has six numbers 1 to 6 written on six faces, then p = 1/2 is
the probability of success, and q = 1 – p =1/2 is the probability of failure.

Example: An urn contains 8 red balls and 10 black balls. We draw six balls from the
urn successively. You have to tell whether or not the trials of drawing balls are
Bernoulli trials when after each draw, the ball drawn is:
1. replaced
2. not replaced in the urn.
Answer:
1. We know that the number of trials are finite. When drawing is done with
replacement, probability of success (say, red ball) is p =8/18 which will be same
for all of the six trials. So, drawing of balls with replacements are Bernoulli trials.
2. If drawing is done without replacement, probability of success (i.e., red ball) in
the first trial is 8/18 , in 2nd trial is 7/17 if first ball drawn is red or, 10/18 if first
ball drawn is black, and so on. Clearly, probabilities of success are not same for all
the trials, Therefore, the trials are not Bernoulli trials.
Binomial Distribution
It is a random variable that represents the number of successes in “N” successive
independent trials of Bernoulli’s experiment. It is used in a plethora of instances
including the number of heads in “N” coin flips, and so on.
Let P and Q denote the success and failure of the Bernoulli Trial respectively. Let’s
assume we are interested in finding different ways in which we have 1 success in all
six trials.
Clearly, six cases are available as listed below:
PQQQQQ, QPQQQQ, QQPQQQ, QQQPQQ, QQQQPQ, QQQQQP

Likewise, 2 successes and 4 failures will show combinations thus making it


difficult to list so many combinations. Henceforth, calculating probabilities of 0, 1, 2,
…, n number of successes can be long and time-consuming. To avoid such lengthy
calculations along with a listing of all possible cases, for probabilities of the number of
successes in n-Bernoulli’s trials, a formula is made which is given as:
If Y is a Binomial Random Variable, we denote this Y∼ Bin(n, p), where p is the
probability of success in a given trial, q is the probability of failure, Let ‘n’ be the total
number of trials, and ‘x’ be the number of successes, the Probability Function P(Y) for
Binomial Distribution is given as:
P(Y) = nCx qn–xpx
where x = 0,1,2…n
Example: When a fair coin is tossed 10 times, find the probability of getting:
1. Exactly Six Heads
2. At least Six Heads
Answer:

SKC Lakshmi Narain College of Technology, Indore


Every coin tossed can be considered as the Bernoulli trial. Suppose X is the number of
heads in this experiment:

We already know, n = 10
p = 1/2
So, P(X = x) = nCx pn-x (1-p)x , x= 0,1,2,3,….n
P(X = x) = 10Cxp10-x(1-p)x
When x = 6,
(i) P(x = 6) = 10C6 p4 (1-p)6

=
(ii) P(at least 6 heads) = P(X >= 6) = P(X = 6) + P(X=7) + P(X=8)+ P(X=9) +
P(X=10)
= 10C6 p4 (1-p)6 + 10C7 p3 (1-p)7 + 10C8 p2 (1-p)8 + 10C9 p1(1-p)9 + 10C10 (1-p)10 =

Negative Binomial Distribution


In a random experiment of discrete range, it is not necessary that we get success in
every trial. If we perform ‘n’ number of trials and get success ‘r’ times where n>r, then
our failure will be (n-r) times. The probability distribution of failure in this case will
be called negative binomial distribution. For example, if we consider getting 6 in the
die is success and we want 6 one time, but 6 is not obtained in the first trial then we
keep throwing the die until we get 6. Suppose we get 6 in the sixth trial then the first 5
trials will be failures and if we plot the probability distribution of these failures then
the plot so obtained will be called as negative binomial distribution.
Poisson Probability Distribution
The Probability Distribution of the frequency of occurrence of an event over a
specific period is called Poisson Distribution . It tells how many times the event
occurred over a specific period. It basically counts the number of successes and takes a
value of the whole number i.e. (0,1,2…). It is expressed as
f(x; λ) = P(X=x) = (λ xe-λ)/x!
where,
 x is number of times event occurred
 e = 2.718…
 λ is mean value

SKC Lakshmi Narain College of Technology, Indore


Binomial Distribution Examples
Binomial Distribution is used for the outcomes that are discrete in nature. Some of
the examples where Binomial Distribution can be used are mentioned below:
 To find the number of good and defective items produced by a factory.
 To find the number of girls and boys studying in a school.
 To find out the negative or positive feedback on something

Cumulative Probability Distribution


The Cumulative Probability Distribution for continuous variables is a function that
gives the probability that a random variable takes on a value less than or equal to a
specified point. It’s denoted as F(x), where x represents a specific value of the random
variable. For continuous variables, F(x) is found by integrating the probability density
function (pdf) from negative infinity to x. The function ranges from 0 to 1, is non-
decreasing, and right-continuous. It’s essential for computing probabilities,
determining percentiles, and understanding the behavior of continuous random
variables in various fields.
Cumulative Probability Distribution takes value in a continuous range; for example,
the range may consist of a set of real numbers. In this case, Cumulative Probability
Distribution will take any value from the continuum of real numbers unlike the discrete
or some finite value taken in the case of Discrete Probability distribution.
Cumulative Probability Distribution is of two types, Continuous Uniform
Distribution, and Normal Distribution.
Continuous Uniform Distribution
Continuous Uniform Distribution is described by a density function that is flat and
assumes value in a closed interval let’s say [P, Q] such that the probability is uniform
in this closed interval. It is represented as f(x; P, Q)
f(x; P, Q) = 1/(Q-P) for P≤x≤Q
f(x; P, Q) = 0; elsewhere
Normal Distribution
Normal Distribution of continuous random variables results in a bell-shaped curve. It
is often referred to as Gaussian Distribution on the name of Karl Friedrich Gauss who
derived its equation. This curve is frequently used by the meteorological department
for rainfall studies. The Normal Distribution of random variable X is given by
n(x; μ, σ) = {1/(√2π)σ}e (-1/2σ^2)(x-μ)^2 for -∞<x<∞
where
 μ is mean
 σ is variance

SKC Lakshmi Narain College of Technology, Indore


Normal Distribution Examples
The Normal Distribution Curve can be used to show the distribution of natural events
very well. Over the period it has become a favorite choice of statisticians to study
natural events. Some of the examples where the Normal Distribution Curve can be
used are mentioned below
 Salary of Working Class
 Life Expectancy of human in a Country
 Heights of Male or Female
 The IQ Level of children
 Expenditure of households

Probability Distribution Function
PROBABILITY DISTRIBUTION FUNCTION is defined as the function that is used to express the
distribution of a probability. Different types of probability, they are expressed
differently. These functions are also used for Probability Density Functions for
different variables.
For Normal Distribution, the Probability Distribution Function for Random Variable
X is given by Fx(x) = P(X ≤ x) where X is the Random variable and P is the
Probability.
The Cumulative Probability Distribution for closed interval a⇢b is given as P(a < X ≤
b) = Fx(b) – Fx(a).
In terms of integrals, the cumulative probability function is given
as
For Random Variable X = p, the Cumulative Probability function is given
as
Binomial Probability Distribution gives some exact values. It is often called
as Probability Mass Function . For a Random Variable X and Space S where X: S⇢A

x) = P({s ∈ S: X(s) = x}).


where A belongs to Random Discrete Variable R, X can be defined as fx(x) = Pr(X =

Probability Distribution Table


The random variables and their corresponding probability is tabulated then it is called
Probability Distribution Table. The following table represents a Probability
Distribution Table:
X X1 X2 X3 X4 …. Xn

P(X) P1 P2 P3 P4 …. Pn

It should be noted that the sum of all probabilities is equal to 1.

SKC Lakshmi Narain College of Technology, Indore


UNIT-2

What is Big Data


Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size
is called Big Data. It is stated that almost 90% of today's data has been generated in the past 3 years.

SKC Lakshmi Narain College of Technology, Indore


Sources of Big Data

These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generates
huge amount of data on a day to day basis as they have billions of users
worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount
of logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends
and accordingly publish their plans and for this they store the data of its
million users.
o Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.

In recent years, Big Data was defined by the “3Vs” but now there is “6Vs” of
Big Data which are also termed as the characteristics of Big Data as follows:
1. Volume:
 The name ‘Big Data’ itself is related to a size which is enormous.
 Volume is a huge amount of data.
 To determine the value of data, size of data plays a very crucial role. If
the volume of data is very large, then it is actually considered as a ‘Big
Data’. This means whether a particular data can actually be considered as a
Big Data or not, is dependent upon the volume of data.
 Hence while dealing with Big Data it is necessary to consider a
characteristic ‘Volume’.
 Example: In the year 2016, the estimated global mobile traffic was 6.2
Exabytes (6.2 billion GB) per month. Also, by the year 2020 we will have
almost 40000 Exabytes of data.
2. Velocity:
 Velocity refers to the high speed of accumulation of data.
 In Big Data velocity data flows in from sources like machines, networks,
social media, mobile phones etc.
 There is a massive and continuous flow of data. This determines the
potential of data that how fast the data is generated and processed to meet
the demands.
 Sampling data can help in dealing with the issue like ‘velocity’.
 Example: There are more than 3.5 billion searches per day are made on
Google. Also, Facebook users are increasing by 22%(Approx.) year by year.

SKC Lakshmi Narain College of Technology, Indore


3. Variety:
 It refers to nature of data that is structured, semi-structured and
unstructured data.
 It also refers to heterogeneous sources.
 Variety is basically the arrival of data from new sources that are both
inside and outside of an enterprise. It can be structured, semi-structured
and unstructured.
o Structured data: This data is basically an organized data. It
generally refers to data that has defined the length and format of
data.
o Semi- Structured data: This data is basically a semi-organised
data. It is generally a form of data that do not conform to the formal
structure of data. Log files are the examples of this type of data.
o Unstructured data: This data basically refers to unorganized data.
It generally refers to data that doesn’t fit neatly into the traditional
row and column structure of the relational database. Texts,
pictures, videos etc. are the examples of unstructured data which
can’t be stored in the form of rows and columns.
4. Veracity:
 It refers to inconsistencies and uncertainty in data, that is data which is
available can sometimes get messy and quality and accuracy are difficult to
control.
 Big Data is also variable because of the multitude of data dimensions
resulting from multiple disparate data types and sources.
 Example: Data in bulk could create confusion whereas less amount of
data could convey half or Incomplete Information.
5. Value:
 After having the 4 V’s into account there comes one more V which stands
for Value! The bulk of Data having no Value is of no good to the company,
unless you turn it into something useful.
 Data in itself is of no use or importance but it needs to be converted into
something valuable to extract Information. Hence, you can state that Value!
is the most important V of all the 6V’s.
6. Variability:
 How fast or available data that extent is the structure of your data is
changing?
 How often does the meaning or shape of your data change?
 Example: if you are eating same ice-cream daily and the taste just keep
changing.

 Big data drivers`.3

The term big data drivers refers to the key factors that contribute to the growth and adoption of big
data technologies and practices. These drivers are the forces or trends that cause the volume,

SKC Lakshmi Narain College of Technology, Indore


variety, and velocity of data to increase, pushing organizations to adopt tools and strategies to
manage, analyze, and leverage large data sets.

Big data drivers include:

1. Technological Advances: Improvements in computing power (e.g., cloud infrastructure,


advanced data processing tools) make it possible to store, manage, and analyze vast amounts
of data efficiently.
2. Data Generation: The proliferation of digital devices (e.g., IoT devices, smartphones,
sensors) and platforms (e.g., social media) generates huge volumes of data continuously.

3. Business Demands: Companies are increasingly using data-driven insights to improve


decision-making, enhance customer experience, and gain competitive advantage, driving the
need for big data technologies.

4. Real-Time and Predictive Analytics: The demand for real-time data processing and
predictive insights pushes the need for big data solutions that can handle rapid data flow and
complex analysis.

5. Cloud Computing: The availability of cloud services enables organizations to store, scale,
and analyze large datasets without heavy upfront infrastructure investment.

6. Social and Economic Factors: Changing consumer behaviors and regulatory requirements
(e.g., GDPR) also contribute to the growing importance of big data in businesses,
governments, and industries.

What is Big-Data Analytics?


Big Data Analytics is all about crunching massive amounts of information to uncover
hidden trends, patterns, and relationships. It’s like sifting through a giant mountain of
data to find the gold nuggets of insight.
Here’s a breakdown of what it involves:
 Collecting Data: Such data is coming from various sources such as social
media, web traffic, sensors and customer reviews.
 Cleaning the Data: Imagine having to assess a pile of rocks that included some
gold pieces in it. You would have to clean the dirt and the debris first. When data is
being cleaned, mistakes must be fixed, duplicates must be removed and the data
must be formatted properly.
 Analyzing the Data: It is here that the wizardry takes place. Data analysts
employ powerful tools and techniques to discover patterns and trends. It is the same
thing as looking for a specific pattern in all those rocks that you sorted through.

SKC Lakshmi Narain College of Technology, Indore


How does big data analytics work?
Big Data Analytics is a powerful tool which helps to find the potential of large and
complex datasets. To get better understanding, let’s break it down into key steps:
 Data Collection: Data is the core of Big Data Analytics. It is the gathering of
data from different sources such as the customers’ comments, surveys, sensors,
social media, and so on. The primary aim of data collection is to compile as much
accurate data as possible. The more data, the more insights.
 Data Cleaning (Data Preprocessing): The next step is to process this
information. It often requires some cleaning. This entails the replacement of
missing data, the correction of inaccuracies, and the removal of duplicates. It is like
sifting through a treasure trove, separating the rocks and debris and leaving only the
valuable gems behind.
 Data Processing: After that we will be working on the data processing. This
process contains such important stages as writing, structuring, and formatting of
data in a way it will be usable for the analysis. It is like a chef who is gathering the
ingredients before cooking. Data processing turns the data into a format suited for
analytics tools to process.
 Data Analysis: Data analysis is being done by means of statistical,
mathematical, and machine learning methods to get out the most important findings
from the processed data. For example, it can uncover customer preferences, market
trends, or patterns in healthcare data.
 Data Visualization: Data analysis usually is presented in visual form, for
illustration – charts, graphs and interactive dashboards. The visualizations provided
a way to simplify the large amounts of data and allowed for decision makers to
quickly detect patterns and trends.
 Data Storage and Management: The stored and managed analyzed data is of
utmost importance. It is like digital scrapbooking. May be you would want to go
back to those lessons in the long run, therefore, how you store them has great
importance. Moreover, data protection and adherence to regulations are the key
issues to be addressed during this crucial stage.
 Continuous Learning and Improvement: Big data analytics is a continuous
process of collecting, cleaning, and analyzing data to uncover hidden insights. It
helps businesses make better decisions and gain a competitive edge.
Types of Big Data Analytics
Big Data Analytics comes in many different types, each serving a different purpose:
1. Descriptive Analytic s: This type helps us understand past events. In social
media, it shows performance metrics, like the number of likes on a post.
2. Diagnostic Analytics: In Diagnostic analytics delves deeper to uncover the
reasons behind past events. In healthcare, it identifies the causes of high patient re-
admissions.

SKC Lakshmi Narain College of Technology, Indore


3. Predictive Analytics: Predictive analytics forecasts future events based on past
data. Weather forecasting, for example, predicts tomorrow’s weather by analyzing
historical patterns.
4. Prescriptive Analytics : However, this category not only predicts results but
also offers recommendations for action to achieve the best results. In e-commerce,
it may suggest the best price for a product to achieve the highest possible profit.
5. Real-time Analytics: The key function of real-time analytics is data processing
in real time. It swiftly allows traders to make decisions based on real-time market
events.
6. Spatial Analytics: Spatial analytics is about the location data. In urban
management, it optimizes traffic flow from the data unde the sensors and cameras
to minimize the traffic jam.
7. Text Analytics: Text analytics delves into the unstructured data of text. In the
hotel business, it can use the guest reviews to enhance services and guest
satisfaction.

Big Data Analytics Technologies and Tools


Big Data Analytics relies on various technologies and tools that might sound complex,
let’s simplify them:
 Hadoop: Imagine Hadoop as an enormous digital warehouse. It’s used by
companies like Amazon to store tons of data efficiently. For instance, when
Amazon suggests products you might like, it’s because Hadoop helps manage your
shopping history.
 Spark: Think of Spark as the super-fast data chef. Netflix uses it to quickly
analyze what you watch and recommend your next binge-worthy show.
 NoSQL Databases: NoSQL databases, like MongoDB, are like digital filing
cabinets that Airbnb uses to store your booking details and user data. These
databases are famous because of their quick and flexible, so the platform can
provide you with the right information when you need it.
 Tableau: Tableau is like an artist that turns data into beautiful pictures. The
World Bank uses it to create interactive charts and graphs that help people
understand complex economic data.
 Python and R: Python and R are like magic tools for data scientists. They use
these languages to solve tricky problems. For example, Kaggle uses them to predict
things like house prices based on past data.
 Machine Learning Frameworks (e.g., TensorFlow): In Machine
learning frameworks are the tools who make predictions. Airbnb
uses TensorFlow to predict which properties are most likely to be booked in certain
areas. It helps hosts make smart decisions about pricing and availability.
These tools and technologies are the building blocks of Big Data Analytics and helps
organizations gather, process, understand, and visualize data, making it easier for them
to make decisions based on information.

SKC Lakshmi Narain College of Technology, Indore


Benefits of Big Data Analytics
Big Data Analytics offers a host of real-world advantages, and let’s understand with
examples:
1. Informed Decisions: Imagine a store like Walmart. Big Data Analytics helps
them make smart choices about what products to stock. This not only reduces waste
but also keeps customers happy and profits high.
2. Enhanced Customer Experiences: Think about Amazon. Big Data Analytics is
what makes those product suggestions so accurate. It’s like having a personal
shopper who knows your taste and helps you find what you want.
3. Fraud Detection: Credit card companies, like MasterCard, use Big Data
Analytics to catch and stop fraudulent transactions. It’s like having a guardian that
watches over your money and keeps it safe.
4. Optimized Logistics: FedEx, for example, uses Big Data Analytics to deliver
your packages faster and with less impact on the environment. It’s like taking the
fastest route to your destination while also being kind to the planet.

Challenges of Big data analytics


While Big Data Analytics offers incredible benefits, it also comes with its set of
challenges:
 Data Overload: Consider Twitter, where approximately 6,000 tweets are posted
every second. The challenge is sifting through this avalanche of data to find
valuable insights.
 Data Quality: If the input data is inaccurate or incomplete, the insights
generated by Big Data Analytics can be flawed. For example, incorrect sensor
readings could lead to wrong conclusions in weather forecasting.
 Privacy Concerns: With the vast amount of personal data used, like in
Facebook’s ad targeting, there’s a fine line between providing personalized
experiences and infringing on privacy.
 Security Risks: With cyber threats increasing, safeguarding sensitive data
becomes crucial. For instance, banks use Big Data Analytics to detect fraudulent
activities, but they must also protect this information from breaches.
 Costs: Implementing and maintaining Big Data Analytics systems can be
expensive. Airlines like Delta use analytics to optimize flight schedules, but they
need to ensure that the benefits outweigh the costs.

Applications of Big Data Analytics


Big Data Analytics has a significant impact in various sectors:
 Healthcare: It aids in precise diagnoses and disease prediction, elevating patient
care.

SKC Lakshmi Narain College of Technology, Indore


 Retail: Amazon’s use of Big Data Analytics offers personalized product
recommendations based on your shopping history, creating a more tailored and
enjoyable shopping experience.
 Finance: Credit card companies such as Visa rely on Big Data Analytics to
swiftly identify and prevent fraudulent transactions, ensuring the safety of your
financial assets.
 Transportation: Companies like Uber use Big Data Analytics to optimize
drivers’ routes and predict demand, reducing wait times and improving overall
transportation experiences.
 Agriculture: Farmers make informed decisions, boosting crop yields while
conserving resources.
 Manufacturing: Companies like General Electric (GE) use Big Data Analytics
to predict machinery maintenance needs, reducing downtime and enhancing
operational efficiency.

Open Source Big data analytics tools

1. Helical Insight: Your Easy-to-Use Open Source BI Tool!

Helical Insight is like a magic wand for your data. It helps you turn your messy
numbers into clear, easy-to-understand insights. Below highlighted are some of the
prominent features of Helical Insight BI product
 self service interface for creating reports, dashboards, info-graphs and map
based analytics
 Plenty of visualization options with drill down, drill through and inter panel
communication options
 NLP (GenAI) based data analysis under development

2. Apache Spark

Apache Spark is a unified analytics engine known for its speed and ease of use. It
extends the MapReduce model to efficiently use more types of computations,
including interactive queries and stream processing. Key features include:

SKC Lakshmi Narain College of Technology, Indore


 In-memory computation: Enhances the processing speed of applications.
 Advanced analytics: Includes support for SQL queries, machine learning, and
graph processing.
 Flexible deployment: Can run on Hadoop, Apache Mesos, Kubernetes,
standalone, or in the cloud.

3. Druid

Apache Druid is a high-performance, column-oriented, distributed data store. It is


well-suited for real-time analytics on large datasets. Druid’s key attributes are:

 Real-time ingestion: Supports real-time data ingestion and querying.


 Fast query performance: Optimized for OLAP (Online Analytical Processing)
queries.
 Scalability and fault tolerance: Designed to handle high throughput and scale
horizontally.
5.Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine capable of solving


a growing number of use cases. It is part of the Elastic Stack, which includes tools
like Kibana, Logstash, and Beats. Features include:

 Full-text search: Advanced search capabilities on large volumes of data.


 Real-time indexing: Allows for the near real-time search and analytics.
 Scalability: Easily scales to hundreds of servers and petabytes of data.
5. Presto

Presto, originally developed by Facebook, is an open-source distributed SQL query


engine. It allows running interactive analytic queries against data sources of all sizes.
Key aspects of Presto are:

 Interactive querying: Executes queries with low latency, even over large
datasets.

SKC Lakshmi Narain College of Technology, Indore


 Pluggable architecture: Can query data from multiple sources like HDFS,
Amazon S3, and traditional databases.
 High performance: Optimized for running fast, ad-hoc queries at scale.

Mobile Business Intelligence (Mobile BI) refers to the delivery and access of business intelligence
(BI) data on mobile devices like smartphones and tablets. It enables users to monitor business
performance, view reports, dashboards, and analytics, and make data-driven decisions on the go.
Mobile BI enhances accessibility and real-time decision-making by providing data insights
wherever and whenever needed.

Tools of Mobile BI

Several tools and platforms are designed to support Mobile BI, offering capabilities like data
visualization, reporting, and analytics. Some popular Mobile BI tools include:

Tableau Mobile:

SKC Lakshmi Narain College of Technology, Indore


Provides interactive data visualizations and dashboards on mobile devices.

Offers native apps for iOS and Android.

Power BI (Microsoft):

Mobile version allows for creating and viewing reports and dashboards on mobile.

Integrates seamlessly with other Microsoft services like Azure and Excel.

Qlik Sense Mobile:

Provides self-service data visualization and discovery with offline access.

Interactive dashboards can be viewed on mobile devices.

SAP BusinessObjects Mobile:

Allows access to enterprise BI solutions and reports on mobile.

Optimized for mobile interaction with enterprise-level security.

Domo:

A cloud-based BI platform offering real-time data visualization on mobile.

Includes collaboration features for team-based decision-making.

Sisense Mobile:

Delivers interactive dashboards on mobile devices.

Focuses on embedding analytics directly into business applications.

Challenges of Mobile BI

While Mobile BI enhances accessibility, it also brings unique challenges:

Data Security:

SKC Lakshmi Narain College of Technology, Indore


Sensitive data is at risk if devices are lost or compromised.

Ensuring proper encryption, secure access, and authentication is critical.

Limited Screen Size:

Presenting complex data and reports on small mobile screens can be challenging.

BI tools need to optimize dashboards and visualizations for mobile layouts.

Offline Access:

Users may need access to BI data without internet connectivity.

Ensuring data synchronization and caching for offline use is a significant challenge.

Performance:

Mobile devices have less processing power compared to desktops.

Ensuring fast and smooth interaction with large datasets requires optimized mobile applications.

User Experience (UX):

Designing an intuitive and user-friendly interface for mobile devices is essential.

Complex BI features may need simplification for mobile users without compromising functionality.

Data Integration and Compatibility:

Mobile BI platforms must integrate with various data sources and enterprise systems.

Ensuring compatibility across different mobile operating systems (iOS, Android) can add
complexity.

Battery Life and Connectivity:

Mobile devices have limited battery life and may face connectivity issues.

SKC Lakshmi Narain College of Technology, Indore


Efficient apps that minimize battery usage and perform well under low bandwidth are essential for
Mobile BI success.

Crowdsourcing Analytics in Big Data refers to the process of leveraging a large group of people,
often through an open call on the internet, to analyze large datasets or solve complex analytical
problems. It combines the principles of crowdsourcing—where tasks are distributed to a wide
audience—and analytics, which involves interpreting and deriving insights from data. In the context
of Big Data, this approach is particularly useful for handling the vast, complex, and unstructured
datasets that require significant human input and diverse perspectives to process effectively.

Key Aspects of Crowdsourcing Analytics in Big Data

Human-In-The-Loop:

In many cases, automated algorithms struggle with certain tasks (e.g., image recognition, sentiment
analysis, and data labeling). Crowdsourcing allows humans to step in where AI and machine
learning models fall short.

Distributed Workforces:

Tasks like data cleaning, labeling, or even feature identification can be distributed to a large,
diverse group of people across the globe, enabling parallel processing of large datasets.

Collective Intelligence:

Crowdsourcing takes advantage of the diverse knowledge and expertise of a large pool of
individuals. This can lead to more creative, accurate, or comprehensive insights compared to
relying solely on algorithms or a small team of data scientists.

SKC Lakshmi Narain College of Technology, Indore


Scale and Speed:

Since the workload is distributed among many participants, it allows for faster processing of large
datasets, often much quicker than what a single team could achieve.

Cost-Effective:

Crowdsourcing can be a cost-effective method, as tasks can be distributed to freelancers,


volunteers, or participants in online platforms that specialize in such tasks (e.g., Amazon
Mechanical Turk, Kaggle).

Applications of Crowdsourcing Analytics in Big Data

Data Labeling and Classification:

Datasets, especially for machine learning models, often need to be labeled (e.g., images, texts).
Crowdsourcing platforms can assign these labeling tasks to many individuals to create high-quality,
labeled data.

Sentiment Analysis:

For companies analyzing social media data or customer feedback, crowdsourcing can help interpret
sentiments in a more nuanced way than automated sentiment analysis tools.

Pattern Recognition:

In cases where visual pattern recognition (such as identifying objects in satellite images) is
required, crowdsourcing can leverage human perception, which is often better at identifying
patterns in complex, unstructured data.

Problem Solving and Innovation:

Crowdsourcing analytics platforms (like Kaggle) allow individuals and teams to tackle complex big
data problems, often in competitions, leading to innovative approaches and insights.

Data Cleaning:

SKC Lakshmi Narain College of Technology, Indore


In large datasets, cleaning and verifying the accuracy of data can be tedious. Crowdsourced workers
can help clean, normalize, or correct errors in datasets.

Challenges of Crowdsourcing Analytics in Big Data

Quality Control:

Since the work is distributed to many individuals with varying levels of expertise, ensuring the
accuracy and quality of results can be challenging. Multiple checks or consensus mechanisms are
often required to validate the output.

Bias and Inconsistency:

Crowd participants may introduce their biases or may interpret data inconsistently. Careful
instruction and sample tasks are necessary to standardize how tasks are performed.

Data Security and Privacy:

When sensitive or proprietary data is involved, sharing data with a large group of participants raises
privacy and security concerns. Anonymizing data or restricting access can mitigate some risks.

Task Complexity:

Some analytical tasks might be too complex to easily distribute to non-experts. In such cases,
crowdsourcing might not be an effective solution compared to more specialized teams or AI-driven
approaches.

Motivation and Engagement:

Maintaining engagement and ensuring that participants remain motivated can be difficult, especially
for repetitive or tedious tasks. Incentive structures are crucial for sustained participation.

Popular Platforms for Crowdsourcing Analytics

Kaggle:

A platform for data science competitions where crowdsourcing analytics is applied to solve
complex data problems.

SKC Lakshmi Narain College of Technology, Indore


Amazon Mechanical Turk (MTurk):

A crowdsourcing marketplace that allows businesses to distribute small tasks (like data labeling and
classification) to a large number of workers.

Zooniverse:

A platform that enables crowdsourcing for scientific research, such as identifying galaxies,
classifying animals in images, and analyzing historical documents.

Conclusion

Crowdsourcing analytics is a powerful tool in the Big Data space, offering a way to distribute
complex tasks to a broad audience, resulting in faster and often more innovative solutions.
However, challenges such as ensuring data quality and managing biases must be carefully
addressed for effective implementation.

SKC Lakshmi Narain College of Technology, Indore


UNIT-3

What is Data Processing?

Data processing means to processing of data i.e. to convert its format. As we all know data is the
very useful and when it is well presented, and it becomes informative and useful. Data processing
process system is also referred as information system. It is also right to say that data processing
becomes the process of converting information into data and also vice-versa.

Processing Data vs Processed Data

Processing data definition involves defining and managing the structure, characteristics, and
specifications of data within an organization.

Processed data definition typically refers to the refined and finalized specifications and attributes
associated with data after it has undergone various processing steps.

In simple words, processing of data can be expressed as:

Process of conversion of data in the computer understandable format.

The sorting or processing of data by a computer.

Stages of Data Processing Process

Data processing process involves a series of stages to transform raw data into meaningful
information. Here are the six fundamental stages of data processing process:

1. Collection

The process begins with the collection of raw data from various sources. The stage establishes the
foundation for subsequent processing, ensuring a comprehensive pool of data relevant to the
intended analysis. It could include surveys, sensors, databases, or any other means of gathering
relevant information.

SKC Lakshmi Narain College of Technology, Indore


2. Preparation
Data preparation focuses on organizing, data cleaning, and formatting raw data. Irrelevant
information is filtered out, errors are corrected, and the data is structured in a way that facilitates
efficient analysis during subsequent stages of processing.

3. Input

During the data input stage, the prepared data is entered into a computer system. This can be
achieved through manual entry or automated methods, depending on the nature of the data and the
systems in place.

4.Data Processing

The core of data processing involves manipulating and analyzing the prepared data. Operations
such as sorting, summarizing, calculating, and aggregating are performed to extract meaningful
insights and patterns.

5. Data Output

The results of data processing are presented in a comprehensible format during the data output
stage. This could include reports, charts, graphs, or other visual representations that facilitate
understanding and decision-making based on the analyzed data.

6. Data Storage

The final stage entails storing the processed data for future reference and analysis. This is crucial
for maintaining a historical record, enabling efficient retrieval, and supporting ongoing or future
data-related initiatives. Proper data storage ensures the longevity and accessibility of valuable
information.

Data Processing Process


There are three main data processing methods – manual, mechanical and electronic.

Manual Data Processing

SKC Lakshmi Narain College of Technology, Indore


Manual data processing relies on human effort to manage and manipulate data. It involves tasks
such as sorting, calculating, and recording information without the use of machines or electronic
devices. While it is prone to errors and time-consuming, manual processing remains relevant in
situations where human judgment, intuition, or a personal touch is necessary.

Mechanical Data Processing

Mechanical data processing involves the use of machines, like punch cards or mechanical
calculators, to handle data. It represents an intermediate stage between manual and electronic
processing, offering increased efficiency over manual methods but lacking the speed and
sophistication of electronic systems. This method was prominent before the widespread adoption of
computers.

Electronic Data Processing

Electronic data processing leverages computers and digital technology to perform data-related
tasks. It has revolutionized the field by significantly enhancing processing speed, accuracy, and
capacity. Electronic data processing encompasses various techniques, including batch processing,
real-time processing, and online processing, making it a cornerstone of modern information
management and analysis.

Types of Data Processing


There are 7 types of Data Processing, mentioned below:

1. Manual Data Processing

In this type, data is processed by humans without the use of machines or electronic devices. It
involves tasks such as manual calculations, sorting, and recording, making it a time-consuming
process.

2. Mechanical Data Processing

This type utilizes mechanical devices, such as punch cards or mechanical calculators, to process
data. While more efficient than manual processing, it lacks the speed and capabilities of electronic
methods.

3. Electronic Data Processing

SKC Lakshmi Narain College of Technology, Indore


Electronic Data Processing (EDP) involves the use of computers to process and analyze data. It
significantly enhances speed and accuracy compared to manual and mechanical methods, making it
a fundamental shift in data processing.

4. Batch Data Processing

Batch processing involves grouping data into batches and processing them together at a scheduled
time. It is suitable for non-time-sensitive tasks and is efficient for large-scale data processing.

5. Real-time Data Processing

Real-time processing deals with data immediately as it is generated. It is crucial for time-sensitive
applications, providing instant responses and updates, often seen in applications like financial
transactions and monitoring systems.

6. Online Data Processing

Online Data Processing (OLTP) involves processing data directly while it is being collected. It is
interactive and supports concurrent transactions, making it suitable for applications that require
simultaneous user interaction and data updates.

7. Automatic Data Processing

Automatic Data Processing (ADP) refers to the use of computers and software to automate data
processing tasks. It encompasses various methods, including batch processing and real-time
processing, to efficiently handle large volumes of data with minimal human intervention.

Advantages of Data Processing

Highly efficient

Time-saving

High speed

Reduces errors

Disadvantages of Data Processing

Large power consumption

Occupies large memory.

The cost of installation is high

Wastage of memory

SKC Lakshmi Narain College of Technology, Indore


What is data integration :
Data integration is the process of combining data from multiple sources into a cohesive and
consistent view. This process involves identifying and accessing the different data sources,
mapping the data to a common format, and reconciling any inconsistencies or discrepancies
between the sources. The goal of data integration is to make it easier to access and analyze data
that is spread across multiple systems or platforms, in order to gain a more complete and accurate
understanding of the data.

There are mainly 2 major approaches for data integration – one is the “tight coupling approach”
and another is the “loose coupling approach”.
Tight Coupling:
This approach involves creating a centralized repository or data warehouse to store the integrated
data. The data is extracted from various sources, transformed and loaded into a data warehouse.
Data is integrated in a tightly coupled manner, meaning that the data is integrated at a high level,
such as at the level of the entire dataset or schema. This approach is also known as data
warehousing, and it enables data consistency and integrity, but it can be inflexible and difficult to
change or update.
 Here, a data warehouse is treated as an information retrieval component.
 In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of individual data
elements or records. Data is integrated in a loosely coupled manner, meaning that the data is
integrated at a low level, and it allows data to be integrated without having to create a central
repository or data warehouse. This approach is also known as data federation, and it enables data
flexibility and easy updates, but it can be difficult to maintain consistency and integrity across
multiple data sources.
 Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases to
obtain the result.
 And the data only remains in the actual source databases.
Issues in Data Integration:

SKC Lakshmi Narain College of Technology, Indore


There are several issues that can arise when integrating data from multiple sources, including:
1. Data Quality: Inconsistencies and errors in the data can make it difficult to combine and
analyze.
2. Data Semantics: Different sources may use different terms or definitions for the same
data, making it difficult to combine and understand the data.
3. Data Heterogeneity: Different sources may use different data formats, structures, or
schemas, making it difficult to combine and analyze the data.
4. Data Privacy and Security: Protecting sensitive information and maintaining security
can be difficult when integrating data from multiple sources.
5. Scalability: Integrating large amounts of data from multiple sources can be
computationally expensive and time-consuming.
6. Data Governance: Managing and maintaining the integration of data from multiple
sources can be difficult, especially when it comes to ensuring data accuracy, consistency, and
timeliness.
7. Performance: Integrating data from multiple sources can also affect the performance of
the system.
8. Integration with existing systems: Integrating new data sources with existing systems
can be a complex task, requiring significant effort and resources.
9. Complexity: The complexity of integrating data from multiple sources can be high,
requiring specialized skills and knowledge.

What is data extraction?

Data extraction is defined as the process of retrieving data from various sources. This step in the
data handling process involves gathering and converting different forms of data into a more usable
or accessible format. The primary goal of data extraction is to collect data from disparate sources
for further processing, analysis, or storage in a centralized location.

The importance of data extraction cannot be overstated, especially in today's data-centric


environment. It enables businesses and organizations to harness valuable insights from their data,
driving strategic decisions and operational improvements. Effective data extraction practices allow
for the seamless integration of data into business intelligence (BI) tools, facilitating comprehensive
analysis and reporting.

In essence, data extraction is a critical first step in the data journey, setting the stage for value
creation through data analysis and interpretation. By efficiently extracting relevant data,
organizations can unlock a wealth of opportunities for innovation, efficiency, and competitive
advantage.

Types of data extraction

Depending on the nature and format of the data, different extraction methods are employed to
efficiently retrieve valuable insights.

SKC Lakshmi Narain College of Technology, Indore


The main types of data extraction include structured, unstructured, and semi-structured data
extraction.

Structured data extraction


Structured data extraction focuses on retrieving data from highly organized sources where the
format and schema are predefined, such as databases, spreadsheets, and other structured formats.
This type of extraction is characterized by its high level of accuracy and efficiency, as the
structured nature of the source data simplifies the identification and collection of specific data
elements. Tools designed for structured data extraction are adept at navigating these environments,
enabling users to specify the exact data needed for their purposes.

Unstructured data extraction


Unstructured data extraction, on the other hand, deals with data that lacks a predefined format or
organization, such as text documents, emails, videos, and social media posts. Extracting data from
these sources is inherently more complex, requiring advanced techniques such as natural language
processing (NLP) and machine learning to interpret and organize the data.

Semi-structured data extraction


Semi-structured data extraction occupies the middle ground between structured and unstructured
data. Sources like Extensible Markup Language (XML) files, JavaScript Object Notation (JSON)
documents, and certain web pages, while not as rigidly structured as databases, still contain markers
or tags that provide some level of organization. Extraction from these sources often involves
parsing the semi-structured format to identify and extract the relevant data.

Each type of data extraction presents its own set of challenges and opportunities.

The role of data extraction in ETL

The extract, transform, and load (ETL) process is a cornerstone of data warehousing and business
intelligence. It involves extracting data from various sources, transforming it into a format suitable
for analysis, and loading it into a destination system, such as a data warehouse. Data extraction is
the first and arguably most critical step in this process, as it involves identifying and retrieving
relevant data from internal and external sources.

Data extraction fits into the ETL process as the foundational phase that determines the quality and
usability of the data being fed into the subsequent stages. Without effective data extraction, the
transform and load phases cannot perform optimally, potentially compromising the integrity and
value of the final dataset. This stage sets the tone for the efficiency of the entire ETL pipeline,
highlighting the importance of employing robust data extraction techniques and tools.

The benefits of using data extraction in ETL


The benefits of using data extraction in the ETL process are manifold:
 Data extraction enables businesses to consolidate data from disparate sources, providing a
unified view that is necessary for comprehensive analysis
 Efficient data extraction processes can significantly reduce the time and effort required to
gather and prepare data for analysis, accelerating time to insight

SKC Lakshmi Narain College of Technology, Indore


 By automating the data extraction phase, organizations can minimize errors and
inconsistencies, ensuring that the data loaded into their analytical systems is accurate and reliable

Data extraction techniques

From a logical and physical standpoint, the projected amount of data to be extracted and the stage in
the ETL process (initial load or data maintenance) may also influence how to extract. Essentially,
you must decide how to conceptually and physically extract data.

Methods of Logical Extraction

Logic extraction can be divided into two types −

Full Extraction

The data is fully pulled from the source system. There's no need to keep track of data source
changes because this Extraction reflects all of the information saved on the source system after the
last successful Extraction.

The source data will be delivered in its current state, with no further logical information (such as
timestamps) required on the source site. An export file of a specific table or a remote SQL query
scanning the entire source table is two examples of full extractions.

Incremental Extraction

Only data that has changed since a particular occurrence in the past will be extracted at a given
time. This event could be the end of the extraction process or a more complex business event such
as the last day of a fiscal period's bookings. To detect this delta change, there must be a way to
identify all the changed information since this precise time event.

This information can be provided by the source data itself, such as an application column indicating
the last-changed timestamp, or by a changing table in which a separate mechanism keeps track of
the modifications in addition to the originating transactions. Using the latter option, in most
situations, entails adding extraction logic to the source system.

As part of the extraction process, many data warehouses do not apply any change-capture
algorithms. Instead, full tables from source systems are extracted to the data warehouse or staging
area, and these tables are compared to a previous source system extract to detect the changed data.

SKC Lakshmi Narain College of Technology, Indore


Although this strategy may have little influence on the source systems, it strains the data warehouse
procedures, especially if substantial data volumes are.

Methods of Physical Extraction

Physically extracting the data can be done in two ways, depending on the chosen logical extraction
method and the source site's capabilities and limits. The data can be extracted online from the
source system or offline from a database. An offline structure like this could already exist or be
created by an extraction routine.

Physical Extraction can be done in the following ways −

Online Extraction

The information is taken directly from the source system. The extraction procedure can link directly
to the source system to access the source tables or connect to an intermediate system to store the
data in a predefined format (for example, snapshot logs or change tables). It's worth noting that the
intermediary system doesn't have to be physically distinct from the source system.

It would be best to evaluate whether the distributed transactions use source objects or prepared
source objects when using online extractions.

Offline Extraction

The data is staged intentionally outside the source system rather than extracted straight from it. The
data was either created by an extraction method or already had a structure (redo logs, archive logs,
or transportable tablespaces).

The following structures should be considered −

 Flat files are files that have a predefined, generic format. For further processing, further
information about the source item is required.
 Oracle-specific format for dump files The containing items' information is included.
 A separate, supplemental dump file contains the information.
 Tablespaces that can be moved

SKC Lakshmi Narain College of Technology, Indore


Data transformation in data mining refers to the process of converting raw data into a format that
is suitable for analysis and modeling. The goal of data transformation is to prepare the data for
data mining so that it can be used to extract useful insights and knowledge. Data transformation
typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the
data.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as between 0
and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant
features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by summing
or averaging, to create new features or attributes.
The data are transformed in ways that are ideal for mining the data. The data transformation
involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some algorithms
It allows for highlighting important features present in the dataset. It helps in predicting the
patterns. When collecting data, it can be manipulated to eliminate or reduce any variance or any
other noise form. The concept behind data smoothing is that it will be able to identify simple
changes to help predict different trends and patterns. This serves as a help to analysts or traders
who need to look at a lot of data which can often be difficult to digest for finding patterns that
they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting data in a
summary format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality of the data used. Gathering accurate data
of high quality and a large enough quantity is necessary to produce relevant results. The
collection of data is useful for everything from decisions concerning financing or business
strategy of the product, pricing, operations, and marketing strategies. For example, Sales, data
may be aggregated to compute monthly& annual total amounts.

3. Discretization: It is a process of transforming continuous data into set of small intervals.


Most Data Mining activities in the real world require continuous attributes. Yet many of
the existing data mining frameworks are unable to handle these attributes. Also, even if a
data mining task can manage a continuous attribute, it can significantly improve its
efficiency by replacing a constant quality attribute with its discrete values. For example,
(1-10, 11-20) (age:- young, middle age, senior).

4. Attribute Construction: Where new attributes are created & applied to assist the mining
process from the given set of attributes. This simplifies the original data & makes the mining
more efficient.

SKC Lakshmi Narain College of Technology, Indore


5. Generalization: It converts low-level data attributes to high-level data attributes using concept
hierarchy. For Example Age initially in Numerical form (22, 25) is converted into categorical
value (young, old). For example, Categorical attributes, such as house addresses, may be
generalized to higher-level definitions, such as town or country.

6. Normalization: Data normalization involves converting all data variables into a given range.
Advantages of Data Transformation in Data Mining
1. Improves Data Quality: Data transformation helps to improve the quality of data by
removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of data from
multiple sources, which can improve the accuracy and completeness of the data.
3. Improves Data Analysis: Data transformation helps to prepare the data for analysis and
modeling by normalizing, reducing dimensionality, and discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive data, or to
remove sensitive information from the data, which can help to increase data security.
5. Enhances Data Mining Algorithm Performance: Data transformation can improve the
performance of data mining algorithms by reducing the dimensionality of the data and scaling
the data to a common range of values.
Disadvantages of Data Transformation in Data Mining
1. Time-consuming: Data transformation can be a time-consuming process, especially when
dealing with large datasets.
2. Complexity: Data transformation can be a complex process, requiring specialized skills
and knowledge to implement and interpret the results.
3. Data Loss: Data transformation can result in data loss, such as when discretizing
continuous data, or when removing attributes or features from the data.
4. Biased transformation: Data transformation can result in bias, if the data is not properly
understood or used.
5. High cost: Data transformation can be an expensive process, requiring significant
investments in hardware, software, and personnel.
6. Overfitting: Data transformation can lead to overfitting, which is a common problem in
machine learning where a model learns the detail and noise in the training data to the extent
that it negatively impacts the performance of the model on new unseen data.

SKC Lakshmi Narain College of Technology, Indore


SKC Lakshmi Narain College of Technology, Indore

You might also like