DA notes
DA notes
1. What is Statistics?
Types of Statistics:
Descriptive Statistics:
Inferential Statistics:
Data:
Sample:
Probability:
Central Tendency:
Variability:
Hypothesis Testing:
Regression:
Correlation:
Sampling:
Problem Solving: Statistical methods can be used to identify and solve problems.
Data Science: Statistical concepts are fundamental to data science and machine
learning.
Statistical methods
are used to analyze data and draw conclusions. They include descriptive statistics,
inferential statistics, predictive analysis, and exploratory data analysis.
Descriptive statistics
Summarize data using indexes like the mean, median, and standard deviation
Present data in charts, graphs, and tables
Make complex data easier to read and understand
Inferential statistics
Draw conclusions from data that are subject to random variation
Study the relationship between different variables
Make predictions for the whole population
Use techniques like hypothesis testing and regression analysis
Predictive analysis
Analyze data to derive past trends and predict future events
Use machine learning algorithms, data mining, data modeling, and artificial
intelligence
Exploratory data analysis
Explore the unknown data associations, Analyze the potential relationships within
the data, and Identify patterns.
Statistical methods are used in research, data analysis, and to make informed
decisions. They can help to eliminate bias from data evaluation and improve
research designs.
Data preparation is the process of cleaning and transforming raw data so it's
ready for analysis. Data cleaning is a key step in data preparation that involves
fixing errors and improving data quality.
Data preparation steps:
Data collection: Gather raw data from various sources, such as databases, web
APIs, or manual entry
Data cleaning: Identify and correct errors, inconsistencies, and anomalies
Data integration: Combine and merge datasets to create a unified dataset
Data transformation: Convert the structure and format of the data
Data normalization: Scale numeric data to a standard range
Data profiling: Identify relationships, connections, and other attributes in data sets
Data cleaning techniques: filling in missing values, filtering out duplicates or
invalid entries, standardizing formats, cross-checking information, and adding
more details.
Benefits of data preparation
Improves data quality
Reduces noise and irrelevant data
Streamlines data for faster and easier processing
Provides clear and well-organized data for better business decisions
You can automate data preparation tasks with tools like KNIME.
Missing data and errors, if not addressed, can significantly impact data analysis
and machine learning model performance, potentially leading to biased or
inaccurate results. Understanding the types of missing data and errors is crucial for
effective data cleaning and handling
Statistical errors occur when the data collected from a study doesn't match the true
value of the population being studied. This can happen due to a number of
reasons, including sampling, measurement, or bias.
Types of statistical errors
Sampling error: The difference between the analysis of a sample and the actual
value of the population.
Statistical Modeling:
In the context of research and data analysis, inference bias and confounding refer
to errors that distort the relationship between an exposure and an outcome, leading
to potentially misleading conclusions. Understanding and addressing these issues
is crucial for drawing valid causal inferences.
Bias:
Bias refers to systematic errors that distort the true relationship between an
exposure and an outcome.
● Examples: Selection bias (participants are selected in a way that skews
the results), information bias (errors in collecting data), or measurement
bias (using tools that are not accurate).
Confounding:
Confounding occurs when a third variable, a "confounder," influences both the
exposure and the outcome, creating a spurious association.
● Example: If a study shows an association between coffee drinking and
heart disease, but coffee drinkers also tend to smoke more, smoking
could be a confounder, obscuring the true relationship between coffee
and heart disease.
Importance of Distinguishing Between Bias and Confounding:
● Valid Causality: Understanding the difference is critical to accurately
interpret research findings and establish causal relationships.
● Policy and Practice: Misinterpretations due to bias or confounding can
lead to ineffective or harmful policies and practices.
● Causal inference: In causal inference, understanding and adjusting for
bias and confounding is essential to accurately estimate the causal effect
of an intervention or exposure
Key terms
● Null hypothesis: The default assumption that there is no effect or
difference between groups or conditions
● Alternative hypothesis: The theory that there is a relationship between
variables
● P-value: A statistical measurement that indicates how likely it is to get the
observed results if the null hypothesis is true
● Significance level: The probability of rejecting the null hypothesis when it
is true
Example
You might use hypothesis testing to determine if the average weight of a dumbbell
in a gym is higher than 90 lbs
Confidence Interval:
A confidence interval is a range of values within which a population parameter
(like the mean) is likely to fall, given a certain level of confidence (e.g., 95%).
Power and robustness:In the context of systems or models, "power" refers to the
ability to detect true effects or differences when they exist, while "robustness"
refers to the ability to maintain performance and stability under various conditions
and variations in data or the environment.
Power: Definition-
The power of a test, model, or system is its capacity to identify a true effect or
signal when it is present. For statistical tests, it's the probability of rejecting a false
null hypothesis, or in other words, the likelihood of a test finding a difference that
truly exists.
Importance:
A powerful system or test is better at uncovering important insights or detecting
true anomalies.
Example:
In the context of statistical analysis, a test is considered powerful if it is good at
correctly identifying a statistically significant difference between two groups
when a real difference exists.
Robustness:
Definition:
Robustness means a system or model remains reliable and performs well
despite changes, uncertainties, or variations in its inputs or operating
environment.
Importance:
Robustness ensures a system's reliability and stability even when faced with
unexpected inputs, changes in conditions, or errors.
Examples:
● A model is considered robust if it continues to make accurate
predictions even with slight variations in the data it's trained on.
● A system is robust if it can maintain its functionality even when
facing network issues or partial component failures.
● In AI, a robust model can perform well on new, unseen data, even if it
contains noise or variations not present in the training data.
● In statistics, a robust statistical test is one that is not highly sensitive to
deviations from assumptions (e.g., normality) or outliers in the data