0% found this document useful (0 votes)
27 views

Unit .......

The document provides an overview of the fundamentals of data analysis, including understanding data types, using descriptive statistics to summarize data, exploring relationships between variables, generating and testing hypotheses, and more. It discusses data analysis techniques such as data collection, data wrangling, exploratory data analysis, and drawing conclusions. Key concepts covered include descriptive versus inferential statistics, measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), and sampling.

Uploaded by

Rajan shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Unit .......

The document provides an overview of the fundamentals of data analysis, including understanding data types, using descriptive statistics to summarize data, exploring relationships between variables, generating and testing hypotheses, and more. It discusses data analysis techniques such as data collection, data wrangling, exploratory data analysis, and drawing conclusions. Key concepts covered include descriptive versus inferential statistics, measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), and sampling.

Uploaded by

Rajan shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Introduction to basic statistics

and analysis
(UNIT-1)
Fundamentals of Data Analysis

The fundamentals of data analysis include understanding the data types, using
descriptive statistics to summarize data, exploring relationships between
variables using correlation and regression analysis, generating and testing
statistical hypotheses, and more. These concepts are essential for any data
analysis project, and understanding them is crucial for extracting meaningful
insights from data.

Data Analysis
• Data analysis is the process of collecting, cleaning, transforming, and
interpreting data to extract insights and knowledge from it. It involves a range
of techniques and methods to extract useful information from data, including
statistical analysis, machine learning, data visualization, and more.
• Data analysis involves collection, preparation, exploratory data analysis, and
drawing conclusions.
• Data analysis is essential in many fields, including business, science, finance,
healthcare, and more. It can help organizations and individuals make informed
decisions, identify patterns and trends, discover hidden insights, and more.
• Data preparation is the most time-consuming part of the analysis process,
making up 80% of the work.

a) Data Collection
Data collection is the first step in any data analysis.
• Data can come from various sources, including web scraping, APIs, databases,
internet resources, and log files.
• It's important to collect data that will help draw relevant conclusions.
A generalized work flow:
b) Data wrangling
• Data wrangling is the process of preparing data for analysis.
• Data can be dirty, requiring cleaning before analysis.
• Common issues include human errors, computer errors, unexpected values, incomplete
information, resolution, relevance of fields, and format of data.
• Data quality issues can be remedied, but some cannot.
Following are some issues:
Human errors: Data is recorded (or even collected) incorrectly, such as putting 100 instead
of 1000, or typos. In addition, there may be multiple versions of the same entry recorded, such as New York
City, NYC.
Computer error: Perhaps we weren't recording entries for a while (missing data).
Unexpected values: Maybe whoever was recording the data decided to use ? for a missing value in a numeric
column.
Incomplete information: Think of a survey with optional questions; not everyone will answer them, so we have
missing data, but not due to computer or human error.
Resolution: The data may have been collected per second, while we need hourly data for our analysis.
Relevance of the fields: Often, data is collected or generated as a product of some process rather than explicitly for
our analysis. In order to get it to a usable state, we will have to clean it up
Format of the data: The data may be recorded in a format that isn’t conducive to analysis, which will require that
we reshape it.
Misconfigurations in data-recording process: Data coming from sources such as misconfigured trackers and/or
webhooks may be missing fields or passing them in the wrong order .
c) Exploratory data analysis:
• EDA involves using visualizations and summary statistics to better understand the
data.
• Data visualization is essential to any analysis.
• Plots can reveal patterns that can't be observed otherwise.
• Plots can be used to observe how variables evolve over time, compare categories, find
outliers, and examine variable distributions.
• When analyzing data, it's important to consider if it's quantitative (measurable) or
categorical (descriptive), and to understand its subdivisions to determine what
operations can be performed on it. Categorical data can be nominal (no meaningful
order) or ordinal (rankable). Quantitative data can be interval (measurable but not
meaningfully comparable with multiplication/division) or ratio (meaningfully
comparable with ratios). For example, temperature is interval while prices, sizes, and
counts are ratio.
d) Drawing conclusions
After we have collected the data for our analysis, cleaned it up, and performed some
thorough EDA, it is time to draw conclusions. This is where we summarize our findings
from EDA and decide the next steps:
• Did we notice any patterns or relationships when visualizing the data?
• Does it look like we can make accurate predictions from our data?
• Does it make sense to move to modeling the data?
• Do we need to collect new data points?
• How is the data distributed?
• Does the data help us answer the questions we have or give insight into the problem we
are investigating?
•Do we need to collect new or additional data?
Statistical foundations

•Statistics is often used to make observations about the data we are analyzing.
•The data we have is called the sample, which is a subset of the population.
•There are two broad categories of statistics: descriptive and inferential.
•Descriptive statistics are used to describe the sample.
•Inferential statistics are used to infer something about the population.
•Sample statistics are used as estimators of population parameters, which require
quantifying bias and variance.
•There are many methods for quantifying bias and variance, including parametric and
non-parametric approaches.
Introduction to Statistics

• Statistics is used to make observations about data.


• Data is divided into samples and populations.
• Descriptive statistics describe the sample, while inferential statistics infer something about
the population.
• Sample statistics are used as estimators of population parameters, and require quantifying
bias and variance.
• Bias and variance can be quantified using parametric and non-parametric methods.
Sampling
•Random sampling is crucial for any analysis to be valid.
• Samples must be representative of the population
• Sampling should not be biased in any way.
• All distinct groups from the population should ideally be represented in the sample.
• Machine learning often involves resampling, or selecting a sample from a sample.
• Simple random sampling is often the best method, where rows are chosen at random.
• When distinct groups exist in the data, stratified random sampling preserves their
proportion.
• Random sampling with replacement (bootstrapping) is sometimes used when there's
not enough data for the other sampling methods.
Descriptive statistics
• Descriptive statistics are used to describe and/or summarize the data we are working with. Univariate
statistics are calculated from one (uni) variable. The statistics will be calculated per variable we are
recording.

• Measures of central tendency describe where most of the data is centered around. Examples of
measures of central tendency are mean, median, and mode.

• Measures of spread or dispersion indicate how far apart values are. Examples of measures of spread
or dispersion are range, variance, and standard deviation.

• Histograms and box plots can also be used to visualize the distribution of the data.
Measures of central tendency:
• Measures of central tendency describe the center of a distribution of data.
• Three common statistics used as measures of center are mean, median, and mode.
a) Mean:
•Mean is the most common statistic used to summarize data.
• Population mean is denoted by μ(mu), and sample mean is denoted by .
• Sample mean is calculated by summing all the values and dividing by the count of values.
The formula for calculating sample mean is:
• X-bar = (x1 + x2 + ... + xn) / n
• xi represents the ith observation of the variable X.
Example: the mean of [0, 1, 1, 2,9] is 2.6 ((0 + 1 + 1 + 2 + 9)/5)

Σ (Greek capital letter sigma) is used to represent a summation, which goes from 1 to n, which is the number of
observations.
b) Median:
• Median is another measure of central tendency, which is robust to outliers.
It represents the middle value in an ordered list of values. Eg: If we take the
numbers [0, 1, 1, 2, 9] again, our median is 1.
• In cases of even number of values, median is calculated by taking the average of the two middle values
• Median represents the 50th percentile of the data, with 50% of values greater than and 50% less than it.
c) Mode:
•Mode is the most frequently occurring value in a dataset.
•It can be useful in cases where we want to know the most common value or if the distribution is bimodal or
multimodal.
•A unimodal distribution has only one mode, a bimodal distribution has two modes, and a multimodal
distribution has many modes.
•It is not as useful as mean and median in statistical analysis but can provide valuable insights in some
cases.
•Mode is not affected by outliers and can be useful in skewed distributions where the mean and median may
not be representative of the data.
•The mode can be calculated for both discrete and continuous data.
Most of the time when we're describing our data, we will use either the mean or the median as our measure of
central tendency.
Measures of Spread:
•The center of the distribution is only gets us partially to being able to summarize the
distribution of our data—we need to know how values fall around the center and how far
apart they are.
•Measures of spread tell us how the data is dispersed; this will indicate how thin (low
dispersion) or wide (very spread out) our distribution is.
•As with measures of central tendency, we have several ways to describe the spread of a
distribution, and which one we choose will depend on the situation and the data.

a) Range: The range is the distance between the smallest value (minimum) and the largest
value (maximum).
b) Variance:
• The variance is another measure of how spread out the data is, but it is the average of the
squared differences from the mean.
• Variance is useful because it tells us how much variation there is within the data set, but it
can be difficult to interpret since it is expressed in squared units. The formula for calculating
the sample variance is:

It gives us upper and lower bounds on what we have in the data, however, if we have any
outliers in our data, the range will be rendered useless.
• Another problem with the range is that it doesn't tell us how the data is dispersed around its
center; it only tells us how dispersed the entire dataset is.
• Variance describes how far apart observations are spread out from their average value (the
mean). The population variance is denoted as sigma-squared (σ2), and the sample variance is
written as (s2).
c) Standard Deviation
The standard deviation is a measure of how much the data deviates from the mean.
A smaller standard deviation indicates that the data is more tightly clustered around the
mean, while a larger standard deviation indicates that the data is more spread out.
The formula for calculating the sample standard deviation is:

•A small standard deviation indicates that data points are closer to the mean.
•A large standard deviation indicates that data points are more spread out.
•The shape of the distribution curve changes with standard deviation:
• Smaller standard deviation = Skinnier peak
• Larger standard deviation = Fatter peak
•The following plot shows a comparison
of a standard deviation of 0.5 to 2.
d) Coefficient of variation:
• The coefficient of variation (CV) is a measure of relative variability.
• It is the ratio of the standard deviation to the mean.
• It is used to compare the level of dispersion of one dataset to another.
• The CV is expressed as a percentage (%).
• A smaller CV indicates less variability or dispersion in the data, while a larger CV indicates
greater variability or dispersion.
• The CV is particularly useful when comparing data with different units or scales.

•For example, if we want to compare the variability of salaries of two companies, one with a
mean salary of $50,000 and a standard deviation of $10,000 and the other with a mean salary
of $70,000 and a standard deviation of $20,000, we can use the CV to compare them. The first
company has a CV of 20% (10,000/50,000) and the second company has a CV of 28.6%
(20,000/70,000), indicating that the second company has a higher level of salary variability
than the first company.
e) Interquartile Range (IQR)
• Measures of spread describe how data is dispersed and how values fall around the center of
the distribution.
•There are several ways to describe the spread of a distribution including range, variance,
standard deviation, and coefficient of variation.
•The range is the distance between the minimum and maximum values.

•Variance and standard deviation are based on the mean and describe how much the data
deviates from the mean.
•Coefficient of variation (CV) is the ratio of standard deviation to the mean and is used to
compare the level of dispersion between datasets with different units.
•Quantiles are values that divide data into equal groups, each containing the same percentage
of the total data. Percentiles give this in 100 parts, while quartiles give it in four (25%, 50%,
75%, and 100%).
•The interquartile range (IQR) is the distance between the 3rd and 1st quartiles and gives us
the spread of data around the median. It quantifies how much dispersion we have in the
middle 50% of our distribution.
f) Quartile coefficient of dispersion
The quartile coefficient of dispersion is used to measure the dispersion of the median.
• It is a unitless measure, which means it can be used to compare datasets with different units.
• The midhinge is the midpoint between the first and third quartiles.
• The semi-quartile range is half of the interquartile range (IQR).
• The formula to calculate the quartile coefficient of dispersion is: (0.5 × IQR) / midhinge.
• It is a robust measure of dispersion that is not affected by outliers.
• A higher quartile coefficient of dispersion indicates greater variability in the dataset.
Summarizing Data
• Descriptive statistics summarize data by its center and dispersion
• The 5-number summary provides five descriptive statistics that summarize data:

•The 5-number summary is helpful for a quick overview of the data


•Visualizing the distribution can also be helpful for understanding the shape and spread of the
data
•Other metrics for summarizing data include mean, standard deviation, range, interquartile
range, and quartile coefficient of dispersion
•Box plots provide a visual representation of
the 5-number summary.
•The median is represented by a thick line in
the box.
•The top and bottom of the box represent Q3
and Q1, respectively.
•Lines (whiskers) extend from both sides of the
box boundaries towards the minimum and
maximum.
•Any values beyond the whiskers are marked as
outliers using points.
•The lower bound of the whiskers is Q1 - 1.5 *
IQR and the upper bound is Q3 + 1.5 * IQR,
which is called the Tukey box plot.
Scaling Data
•When comparing variables from different distributions, we need to scale the data.
•One way to scale the data is to use the range through min-max scaling:
• For each data point, subtract the minimum of the dataset and then divide by the range.
• This normalizes the data and scales it to the range [0, 1].

•Another way to scale the data is to use the mean and standard deviation through
standardization:
• For each observation, subtract the mean and then divide by the standard deviation.
• This gives us a Z-score, which is a normalized distribution with a mean of 0 and a
standard deviation of 1.
• The Z-score tells us how many standard deviations from the mean each observation is.
Quantifying relationships between variables
•In the previous section we deal with Univariate statistics that only allow us to analyze one
variable at a time.
• Multivariate statistics enable us to measure relationships between variables.
• This allows us to quantify correlations and make predictions for future behavior.
•T he covariance is a statistic that measures the relationship between two variables.
• It shows their joint variance and indicates the direction of the relationship:
• Positive covariance means the variables tend to increase or decrease together.
• Negative covariance means the variables tend to move in opposite directions.
• Zero covariance means there is no relationship between the variables.
• However, the covariance is not standardized and depends on the units of the variables. To
address this issue, we can use the correlation coefficient, which standardizes the covariance
and ranges from -1 to 1
E(X)is read as the expected value of X or the expectation of X, and it is calculated by summing all
the possible values of X multiplied by their probability—it's the long-run average of X.

• Covariance indicates the relationship between two variables, but the magnitude is not easy to
interpret.
• The sign of the covariance tells us if the variables are positively or negatively correlated.
•To quantify the strength of the relationship, we use correlation.
•Correlation measures how variables change together in direction and magnitude.
•To find the correlation, we use the Pearson correlation coefficient (symbolized by ρ):
• Calculate the covariance between the two variables.
• Divide the covariance by the product of the standard deviations of the variables.
• The result is a value between -1 and 1, where -1 indicates a perfect negative correlation, 1
indicates a perfect positive correlation, and 0 indicates no correlation.
• The correlation coefficient normalizes the covariance and ranges from -1 to 1.
• A value of -1 indicates a perfect negative correlation, 1 indicates a perfect positive
correlation, and 0 indicates no correlation.
• Values near 1 in absolute value indicate strong correlation, while those near 0.5 indicate weak
correlation.
• Scatter plots are a visual representation of correlation.
•A plot with ρ = 0.11 indicates no correlation between the variables.
•A plot with ρ = -0.52 indicates weak negative correlation.
•A plot with ρ = 0.87 indicates strong positive correlation.
•A plot with ρ = -0.99 indicates near perfect negative correlation.
•The scatter plots show how the variables change together in direction and strength.
•Correlation does not imply causation, and other factors may influence the relationship between the
variables.
•Causation refers to a relationship between two events or variables in which one event (the cause) brings
about or produces the other event (the effect).
• Correlation does not necessarily imply causation, and there could be other variables
involved that cause both variables of interest.
• It is often difficult to establish causation in scientific research, and more evidence is
needed beyond a correlation.
• Visualizations, such as scatter plots, are useful for quickly identifying the presence of a
relationship between variables and their direction and strength.
• Nonlinear relationships between variables can be identified through visualizations as
well, which may not be apparent through numerical analysis.
The two plots show data with strong positive correlations.
• The left plot has a logarithmic relationship between the variables, while the right one has an
exponential relationship.
• It's important to visually examine the data using scatter plots, as the relationship between
variables may not always be linear.
• Scatter plots help in identifying non-linear relationships, such as logarithmic, exponential,
quadratic, or other non-linear functions.
• Knowing the type of relationship between variables is important for making accurate
predictions and understanding the data.
Pitfalls of Summary Statistics
• Summary statistics, such as mean, standard deviation, and correlation coefficients, can be
misleading when used alone to describe data.
• Plotting data is important to visually examine the relationships and patterns within the data.
• Anscombe's quartet is a famous example of four different datasets with identical summary
statistics and correlation coefficients, but with vastly different relationships when plotted.
• Summary statistics can hide important information about the data, such as outliers or non-
linear relationships.
• Therefore, it is important to use both summary statistics and visualizations to get a full
understanding of the data.
Prediction and forecasting
Say our favorite ice cream shop has asked us to help predict how many ice creams they can
expect to sell on a given day.
They are convinced that the temperature outside has a strong influence on their sales, so
they collected data on the number of icecreams sold at a given temperature.
We agree to help them, and the first thing we do is make a scatter plot of the data they gave
us:
We can observe an upward trend in the scatter plot:
• More ice creams are sold at higher temperatures.
• In order to help out the ice cream shop, though, we need to find a way to make predictions from
this data. We can use a technique called regression to model the relationship between
temperature and ice cream sales with an equation.
• Using this equation, we will be able to predict ice cream sales at a given temperature.
• Regression is a statistical technique used to model the relationship between a dependent variable
and one or more independent variables.
• In this case, we want to predict ice cream sales based on temperature, so we will use simple
linear regression. The dependent variable is ice cream sales, and the independent variable is
temperature.
• Simple linear regression models the relationship as a straight line.
• We will use the data we have collected to find the equation of the line that best fits the data.
• The equation of the line will allow us to make predictions about ice cream sales based on
temperature.
The regression line in the scatter plot yields the following equation for the relationship:
Example :
Today the temperature is 35°C, so we plug that in for temperature in the equation. The
result predicts that the ice cream shop will sell 24.54 ice creams. This prediction is
along the red line in the previous plot.

Note: Remember that correlation does not imply causation. People may buy ice cream when it is warmer, but
warmer temperatures don’t cause people to buy ice cream.
•When using the regression line to make predictions, it's important to understand the
difference between the solid and dotted portions of the line.
•The solid portion of the line is used for interpolation, meaning that we are predicting ice
cream sales for temperatures within the range of temperatures we used to create the
regression.
•The dotted portion of the line is used for extrapolation, meaning that we are predicting ice
cream sales for temperatures outside of the range of temperatures we used to create the
regression.
•Extrapolation can be dangerous because many trends don't continue indefinitely and we may
be making predictions for situations that we have no data on. For example, if we predict ice
cream sales for a temperature of 45°C, it may be so hot that people decide not to leave their
houses and we could be predicting zero sales instead of the predicted 39.54.

Note: We can also predict categories. Imagine that the ice cream shop wants to know which
flavor of ice cream will sell the most on a given day.
•Time series forecasting involves predicting future values based on past values.
Before modeling a time series, it's common to use time series decomposition to split it into
components. Forecasting is a type of prediction for time series.
•The trend component describes the long-term behavior of the time series, without
accounting for seasonal or cyclical effects.
•The seasonality component explains systematic and calendar-related movements of the time
series.
•The cyclical component accounts for unexplained or irregular changes in the time series,
and is difficult to anticipate with a forecast.
•We can use Python to decompose a time series into trend, seasonality, and residuals (which
capture the cyclical component).
•The residual is what's left after removing the trend and seasonality from the time series.
Inferential statistics
• Inferential statistics is used to infer or deduce things about the population based on
sample data.
• Conclusions drawn from inferential statistics must take into account whether the
study was observational or an experiment.
• Observational studies are where the independent variable is not under the control
of researchers and we observe those taking part in the study. Examples of
observational studies include studies on smoking, where we cannot force people to
smoke.
•Because we cannot control the independent variable in observational studies, we
cannot conclude causation.
•Experiments, on the other hand, involve manipulating the independent variable to
determine its effect on the dependent variable.
•In experiments, researchers can control the independent variable, allowing for
conclusions about causation.
•An experiment is where, researchers are able to directly manipulate the independent
variable and randomly assign subjects to control and test groups.
•Examples of experiments include A/B tests, which are commonly used in website
redesigns and ad copy.
•The control group in experiments does not receive treatment but may be given a
placebo depending on the study.
•The ideal setup for experiments is double-blind, where researchers administering the
treatment do not know which is the placebo and which subject belongs to which group.
•By controlling the independent variable and randomly assigning subjects to groups,
experiments allow for conclusions about causation.
•The independent variable is the variable that is being manipulated by the researchers,
while the dependent variable is the variable being measured to see if there is an effect.
•Inferential statistics allows us to make statements about the population based on sample data.
•Sample statistics are estimators for population parameters.
•Estimators require confidence intervals, which provide a point estimate and a margin of error.
•Confidence intervals represent the range in which the true population parameter is expected
to be found at a certain confidence level.
•The most commonly used confidence level is 95%, but other levels such as 90% and 99% are
also used.
•A higher confidence level leads to a wider interval.
•At the 95% confidence level, 95% of the confidence intervals calculated from random
samples of the population contain the true population parameter.
•Confidence intervals are a way to measure the uncertainty associated with estimates from
sample data.
•Hypothesis tests allow us to test whether the true population parameter is less than, greater than, or
not equal to some value at a certain significance level (alpha).
•The initial assumption or null hypothesis is stated at the beginning of the hypothesis testing
process.
•The level of statistical significance, usually 5%, represents the probability of rejecting the null
hypothesis when it is true.
•The critical value for the test statistic is calculated based on the amount of data and the type of
statistic being tested.
•The critical value is compared to the test statistic from our data, and we decide to either reject or
fail to reject the null hypothesis.
•Hypothesis tests are closely related to confidence intervals, with the significance level being
equivalent to 1 minus the confidence level.
•A result is statistically significant if the null hypothesis value is not in the confidence interval.
•Hypothesis testing is a powerful tool for making conclusions about the population based on sample
data, but it is important to choose appropriate significance levels and interpret results carefully.
Essential Python Libraries
1) NumPY
•NumPy is a numerical computing package for Python.
•It provides data structures, algorithms, and library glue for scientific applications involving
numerical data in Python.
•The main data structure in NumPy is the multidimensional array object ndarray, which is fast and
efficient.
•NumPy provides functions for performing element-wise computations with arrays, as well as
mathematical operations between arrays.
•NumPy includes tools for reading and writing array-based datasets to disk.
•NumPy supports linear algebra operations, Fourier transform, and random number generation.
•NumPy has a mature C API that allows Python extensions and native C or C++ code to access
NumPy's data structures and computational facilities.
•NumPy is a fundamental tool in scientific computing with Python and is widely used in data
science, machine learning, and other fields
• NumPy is commonly used in data analysis as a container for data to be passed
between algorithms and libraries.
• NumPy arrays are more efficient for storing and manipulating numerical data
than other built-in Python data structures.
• Libraries written in lower-level languages, such as C or Fortran, can operate
on the data stored in a NumPy array without copying data into another
memory representation, making it a popular choice for numerical computing
tools in Python.
• Many numerical computing tools for Python assume NumPy arrays as the
primary data structure or aim to have seamless interoperability with NumPy.
• NumPy is widely used in data analysis, machine learning, and scientific
computing in Python due to its efficient array processing capabilities and wide
range of functions and tools.
2) pandas
• pandas is a high-level library for working with structured or tabular
data in Python.
• It was first introduced in 2010 and has since become a cornerstone of
data analysis in Python.
• pandas provides powerful data structures, including the DataFrame and
Series objects.
•The DataFrame is a tabular data structure with labeled rows and
columns, making it easy to work with structured data.
•The Series is a one-dimensional labeled array object, often used to
represent a single column of data in a DataFrame.
•With pandas, it is fast, easy, and expressive to manipulate, clean, and
transform data.
•pandas combines NumPy's array-computing with the flexible data
manipulation capabilities of spreadsheets and relational databases
•It offers sophisticated indexing functionality for reshaping, slicing, and
performing aggregations on data
•Data manipulation, preparation, and cleaning are crucial skills in data
analysis, and pandas is a primary focus of this book
•The creator of pandas, Wes McKinney, started building it in 2008 to
address specific requirements that were not met by other tools at the time
•These requirements included data structures with labeled axes supporting
automatic or explicit data alignment, integrated time series functionality,
arithmetic operations and reductions that preserve metadata, and flexible
handling of missing data
•Merge and other relational operations found in popular databases, such
as SQL, are also included in pandas.

You might also like