Module 2 BDA
Module 2 BDA
The rise in technology has led to the production and storage of voluminous amounts of data.
Due to large volume of data, variety of data, various forms and formats pose challenges to
conventional systems for storage, processing and analysis.
Increasing complexity needs a means of quick processing analyzing and usage of data
Big Data
Definition: Big Data is high-volume, high –velocity and high-variety information asset that requires new
forms of processing for enhanced decision making, insight discovery and process optimization
“A collection of data sets so large or complex that traditional data processing applications are inadequate”-
Wikipedia
“Data of very large size, typically to the extent that its manipulation and management present significant
logistical challenges” - Oxford
“Big data refers to data sets whose size is beyond the ability of typical database software tool to capture,
store, manage and analyze” – McKinsey Global Institute
Data: is information, usually in the form of facts or statistics that can be analyzed or used.
Web Data: Data present on web servers in the form of text, images, videos, audios and multimedia files for
web users.
Possible to
• Insert, delete, update,
append
• Indexing
• Scalability
• Transaction processing
• Encryption and • Do not conform and associate
Decryption with any data models
• File types: .TXT, .CSV
• Data may have internal
structure, but do not reveal any
relationship
Multi-structured data: consists of multiple formats of data ex: structured, semi structures/unstructured
Example: streaming of data on customer interactions, data of multiple sensors, data at web/enterprise server etc
Big Data Characteristics
Big Data Types
Social networks and web data: Facebook, Twitter, e-mails, blogs et.,
Transaction data and Business Processes (BPs): credit card transactions, flight bookings
etc.,
Customer master data: data for facial recognition and for the name, date of birth, location
and income category etc.,
Machine-generated data: machine-to-machine, IoT, web blogs, computer systems log etc.,
A probability is a number between 0 and 1 that measures the likelihood that some
event will occur. An event with probability 0 cannot occur, whereas an event with
probability 1 is certain to occur. An event with probability greater than 0 and less than 1 involves
uncertainty. The closer its probability is to 1, the more likely it is to occur
Rule of Complements
The simplest probability rule involves the complement of an event. If A is any event, then the
complement of A, denoted by A (or in some books by Ac), is the event that A does not occur..
If the probability of A is P(A), then the probability of its complement, P(A), is given
by Equation P(A) = 1 − P(A) Equivalently, the probability of an event and the probability of its
complement sum to 1. For example, if you believe that the probability of the Dow Finishing at or
above 14,000 is 0.25, then the probability that it will Finish the year below 14,000 is
1 − 0.25 = 0.75.
Rule of Complements
The simplest probability rule involves the complement of an event. If A is any event, then the
complement of A, denoted by A (or in some books by Ac), is the event that A does not occur.
14,000 mark, then the complement of A is that the Dow will inish the year below 14,000.
If the probability of A is P(A), then the probability of its complement, P(A), is given
by Equation P(A) = 1 − P(A) Equivalently, the probability of an event and the probability of its
complement sum to 1. For example, if you believe that the probability of the Dow Finishing at or
above 14,000 is 0.25, then the probability that it will Finish the year below 14,000 is
1 − 0.25 = 0.75.
Addition Rule
Events are mutually exclusive if at most one of them can occur. That is, if one of them occurs, then
none of the others can occur.,
exhaustive events, which means that they exhaust all possibilities—one
of these three events must occur.
Let A1 through An be any n events. Then the addition rule of probability involves the probability that
at least one of these events will occur.
In addition, if the events A1 through An are exhaustive, then the probability is one because one of the
events is certain to occur
Addition Rule for Mutually Exclusive Events
P(at least one of A1 through An) = P(A1) + P(A2) + … + P(An) (4.2)
Addition Rule
For example, consider the following three
events involving a company’s annual revenue for the coming year:
(1) revenue is less than $1 million,
(2) revenue is at least $1 million but less than $2 million, and
(3) revenue is at least $2 million
Therefore, their probabilities must sum to 1. Suppose these probabilities are P(A1) = 0.5,
P(A2) = 0.3, and P(A3) = 0.2
For example, the event that revenue is at least $1 million is the event
that either A2 or A3 occurs. From the addition rule, its probability is
P(revenue is at least $1 million) = P(A2) + P(A3) = 0.5
P(revenue is less than $2 million) = P(A1) + P(A2) = 0.8
P(revenue is less than $1 million or at least $2 million) = P(A1) + P(A3) = 0.7
Conditional Probability and the Multiplication Rule
Let A and B be any events with probabilities P(A) and P(B). Typically, the probability
P(A) is assessed without knowledge of whether B occurs. However, if you are told that
B has occurred, then the probability of A might change. The new probability of A is
called the conditional probability of A given B, and it is denoted by P(A∣B).
Conditional Probability
P(A∣B) =P(A and B)
P(B)
The numerator in this formula is the probability that both A and B occur. This probability must be
known to ind P(A∣B). However, in some applications P(A∣B) and P(B) are known. Then you can
multiply both sides of Equation (4.3) by P(B) to obtain the following multiplication rule for P(A
and B).
Multiplication Rule P(A and B) = P(A∣B) P(B)
Conditional Probability and the Multiplication Rule
Let A be the event that Bender meets its end-of-July deadline, and let B be the event that Bender
receives the materials from its supplier by the middle of July. The probabilities
Bender is best able to assess on July 1 are probably P(B) and P(A∣B). At the beginning of July,
Bender might estimate that the chances of getting the materials on time from its supplier are 2 out of
3, so that P(B) = 2/3. Also, thinking ahead, Bender estimates that if it receives the required materials
on time, the chances of meeting the end-of-July deadline are 3 out of 4. This is a conditional
probability statement, namely, that P(A∣B) = 3/4. Then
the multiplication rule implies that
P(A and B) = P(A∣B)P(B) = (3/4) (2/3) = 0.5
That is, there is a fifty-fifty chance that Bender will get its materials on time and meet its end-of-July
deadline.
Equally Likely Events
Much of what you know about probability is probably based on situations where outcomes are
equally likely. These include lipping coins, throwing dice, drawing balls from urns, and other random
mechanisms.
For example, suppose an urn contains 20 red marbles and 10 blue marbles. You plan to randomly
select ive marbles from the urn, and you are interested, say, in the probability of selecting at least
three red marbles
PROBABILITY DISTRIBUTION OF A SINGLE RANDOM VARIABLE
Normal distribution, also known as the Gaussian distribution, is a probability distribution that is
symmetric about the mean, showing that data near the mean are more frequent in occurrence than
data far from the mean.
In graphical form, the normal distribution appears as a "bell curve".
What Is Binomial Distribution?
Binomial distribution is a statistical distribution that summarizes the probability that a value will take
one of two independent values under a given set of parameters or assumptions
Binomial distribution is a common discrete distribution used in statistics, as opposed to a continuous
distribution, such as normal distribution. This is because binomial distribution only counts two states,
typically represented as 1 (for a success) or 0 (for a failure), given a number of trials in the data.
Normal, Binomial, Poisson and Exponential distribution
Any particular normal distribution is speciied by its mean and standard deviation. By changing the
mean, the normal curve shifts to the right or left. By changing the standard deviation, the curve
becomes more or less spread out
Continuous Distributions and Density Functions
For continuous distributions such as the normal distribution Now instead of a list of
possible values, there is a continuum of possible values, such as all values between 0 and 100 or all
values greater than 0. Instead of assigning probabilities to each individual value in the continuum, the
total probability of 1 is spread over this continuum.
The key to this spreading is called a density function, which acts like a histogram. The higher the value
of the density function, the more likely this region of the continuum is.
NORMAL DISTRIBUTION
The binomial distribution is a discrete distribution that can occur in two situations: (1) when
sampling from a population
with only two types of members (males and females, for example), and (2) when
performing a sequence of identical experiments, each of which has only two possible outcomes.
Consider a situation where there are n independent, identical trials, where the
probability of a success on each trial is p and the probability of a failure is 1 − p.
Define X to be the random number of successes in the n trials. Then X has a binomial
distribution with parameters n and p.
Suppose that a bank manager is studying the pattern of customer arrivals at her branch location. As
indicated previously in this section, the number of arrivals in an hour at a facility such as a bank is
often well described by a Poisson distribution with parameter λ, where λ represents the expected
number of arrivals per hour. An alternative way to view the uncertainty in the arrival process is to
consider the times between customer arrivals. The most common probability distribution used to
model these times, often called interarrival times, is the exponential distribution