0% found this document useful (0 votes)
24 views

STATISTICAL CONCEPTS-module1

STATISTICAL CONCEPTS

Uploaded by

Smitha Rajesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

STATISTICAL CONCEPTS-module1

STATISTICAL CONCEPTS

Uploaded by

Smitha Rajesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

MODULE- I

INTRODUCTION TO BIG DATA

Statistical concepts

 Statistics is a branch of applied or business mathematics where we


collect, organize, analyse and interpret numerical facts. Statistical
methods are the concepts, models, and formulas of mathematics used in
the statistical analysis of data.
 They can be subdivided into two main categories - Descriptive Statistics
and Inferential Statistics.

 Descriptive statistics further consists of measure of central tendency and


measure of dispersion and inferential statistics consists of estimation and
hypothesis testing.

1. Descriptive statistics

Descriptive statistics methods involve summarizing or describing the


sample of data in various forms to get an overall gist of the data.

2. Inferential Statistics

In contrast, inferential statistics try to make assumptions about the


population of the data, given the sample; or in predicting various
outcomes.
RANDOM EXPERIMENTS
 Random experiment is the process to observe the event having an
uncertain outcome.
 When we toss a coin the outcome is uncertain and hence, it can be termed
as a random experiment.
 The result of a random experiment is known as the outcome and the set of
all the possible outcomes of an experiment is known as sample space.
 If we repeat an experiment n number of times, then each time the
experiment is done is known as a trial.
RANDOM VARIABLES

 A random variable is a variable where value is unknown or a function


that assigns values to every of an experiment’s outcomes.
 They are often designated by letters.
 Random variables can be classified as discrete which are variables that
have specific values and continuous which are variables that can have any
values within a continuous range.
 It is different from an algebraic variable. The variable in the algebraic
equation is an unknown value that could be calculated.
 Whereas a random variable has a set of values, and any of those values
can be the resulting outcome.
 Example: tossing a coin or dice.
Types of random variables

1. Discrete Random Variable

 As the name Suggest, Discrete random variables consist of distinct or


discrete unique values. It takes a countable number of distinct values.
Now, Consider an experiment where a coin is tossed five times.

 Discrete Random Variable Example:

♦ Tossing a Coin

Here the number of outcomes that can occur is either a Head or a


Tail. Hence we can denote Head, Tail as Random variables as they are
distinct in nature.
2. Continuous Random Variable

An example of a continuous random variable can be an experiment that involves


measuring the amount of rainfall in a city over a year or the average height of a
random group of 25 people.

 Continuous Random Variable Example

► Heights of people playing Basketball.

Here Height can be any value between 4 feet’s to 7 feet’s respectively.

POPULATION AND SAMPLING

Population
 A population is an entire collection of objects or observations from which
we may collect data. It is the entire group we are interested in, which we
wish to describe or draw conclusions about.
 For each population, there are many possible samples. It is important
that the investigator carefully and completely defines the population
before collecting the sample, including a description of the members to
be included.
Sample
 A sample is a group of units selected from a larger group (the
population).
 By studying the sample, it is hoped to draw valid conclusions about the
larger group.
 A sample is generally selected for study because the population is too
large to study in its entirety. The sample should be representative of the
general population. This is often best achieved by random sampling
SAMPLING DISTRIBUTION

 A sampling distribution is a probability distribution of a statistic obtained


through a large number of samples drawn from a specific population.
 The sampling distribution of a given population is the distribution of
frequencies of a range of different outcomes that could possibly occur for
a statistic of a population.
 A lot of data drawn and used by academicians, statisticians, researchers,
marketers, and analysts are actually samples, not populations.

SAMPLING

PROBABILITY NON- PROBABILITY


SAMPLING SAMPLING

Probability vs. Non-Probability Samples

As a group, sampling methods fall into one of two categories.

 Probability samples. With probability sampling methods, each


population element has a known (non-zero) chance of being chosen for
the sample.
 Non-probability samples. With non-probability sampling methods, we
do not know the probability that each population element will be chosen,
and/or we cannot be sure that each population element has a non-zero
chance of being chosen.
Non-probability sampling methods offer two potential advantages - convenience
and cost. The main disadvantage is that non-probability sampling methods do
not allow you to estimate the extent to which sample statistics are likely to
differ from population parameters. Only probability sampling methods permit
that kind of analysis.

Non-Probability Sampling Methods

Two of the main types of non-probability sampling methods are voluntary


samples and convenience samples.

 Voluntary sample. A voluntary sample is made up of people who self-


select into the survey. Often, these folks have a strong interest in the main
topic of the survey.

Suppose, for example, that a news show asks viewers to participate in an


online poll. This would be a volunteer sample. The sample is chosen by
the viewers, not by the survey administrator.

 Convenience sample. A convenience sample is made up of people who


are easy to reach.

Consider the following example. A pollster interviews shoppers at a local


mall. If the mall was chosen because it was a convenient site from which
to solicit survey participants and/or because it was close to the pollster's
home or business, this would be a convenience sample.

Probability Sampling Methods

The main types of probability sampling methods are simple random sampling,
stratified sampling, cluster sampling, multistage sampling, and systematic
random sampling. The key benefit of probability sampling methods is that they
guarantee that the sample chosen is representative of the population. This
ensures that the statistical conclusions will be valid.

 Simple random sampling. Simple random sampling refers to any


sampling method that has the following properties.
 The population consists of N objects.
 The sample consists of n objects.
 If all possible samples of n objects are equally likely to occur, the
sampling method is called simple random sampling.

There are many ways to obtain a simple random sample. One way would
be the lottery method. Each of the N population members is assigned a
unique number. The numbers are placed in a bowl and thoroughly mixed.
Then, a blind-folded researcher selects n numbers. Population members
having the selected numbers are included in the sample.

 Stratified sampling. With stratified sampling, the population is divided


into groups, based on some characteristic. Then, within each group, a
probability sample (often a simple random sample) is selected. In
stratified sampling, the groups are called strata.

As a example, suppose we conduct a national survey. We might divide


the population into groups or strata, based on geography - north, east,
south, and west. Then, within each stratum, we might randomly select
survey respondents.

 Cluster sampling. With cluster sampling, every member of the


population is assigned to one, and only one, group. Each group is called a
cluster. A sample of clusters is chosen, using a probability method (often
simple random sampling). Only individuals within sampled clusters are
surveyed.
Note the difference between cluster sampling and stratified sampling.
With stratified sampling, the sample includes elements from each
stratum. With cluster sampling, in contrast, the sample includes elements
only from sampled clusters.

 Multistage sampling. With multistage sampling, we select a sample by


using combinations of different sampling methods.

For example, in Stage 1, we might use cluster sampling to choose clusters


from a population. Then, in Stage 2, we might use simple random
sampling to select a subset of elements from each chosen cluster for the
final sample.

 Systematic random sampling. With systematic random sampling, we


create a list of every member of the population. From the list, we
randomly select the first sample element from the first k elements on the
population list. Thereafter, we select every kth element on the list.

This method is different from simple random sampling since every


possible sample of n elements is not equally likely.

RE-SAMPLING

 Resampling is the method that consists of drawing repeated samples


from the original data samples. The method of Resampling is a
nonparametric method of statistical inference. In other words, the method
of resampling does not involve the utilization of the generic distribution
tables (for example, normal distribution tables) in order to compute
approximate p probability values.
 Resampling involves the selection of randomized cases with replacement
from the original data sample in such a manner that each number of the
sample drawn has a number of cases that are similar to the original data
sample. Due to replacement, the drawn number of samples that are used
by the method of resampling consists of repetitive cases.
 Resampling is also known as Bootstrapping or Monte Carlo Estimation.

STATISTICAL INFERENCE

 The general idea that underlies statistical inference is the comparison of


particular statistics from on observational data set (i.e. the mean, the
standard deviation, the differences among the means of subsets of the
data), with an appropriate reference distribution in order to judge the
significance of those statistics.
 When various assumptions are met, and specific hypotheses about the
values of those statistics that should arise in practice have been specified,
then statistical inference can be a powerful approach for drawing
scientific conclusions that efficiently uses existing data or those
collected for the specific purpose of testing those hypotheses.

You might also like