0% found this document useful (0 votes)
14 views

Stata....Basic Note

Sustainability

Uploaded by

abiydemisse1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Stata....Basic Note

Sustainability

Uploaded by

abiydemisse1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Basic Note

What is the difference between a histogram


and a bar graph?
• Although histograms and bar charts use a column-based
display, they serve different purposes.
• A bar graph is used to compare discrete or categorical
variables in a graphical format whereas a histogram depicts the
frequency distribution of variables in a dataset.
• Histograms visualize quantitative data or numerical data,
whereas bar charts display categorical variables.
• In most instances, the numerical data in a histogram will be
continuous (having infinite values).
Bar charts
• A bar chart or a bar graph is a type of data visualization used to
compare discrete data categories or data groups.
• They are best for those cases when you need data in separate, non-
adjacent horizontal bars (=bar chart) or vertical columns (=column
chart).
• The reason is that data visualized in separate columns is easy to
compare. This is why bar charts are commonly used for nominal and
categorical data, eg. product categories, cities, months, countries,
and similar discrete values.
• Bar charts are mainly used when you want to compare or contrast
discrete data categories or groups. Bar charts are commonly used in
nominal or categorical data, e.g. different categories of data in
products, cities, or months.
• Bar charts usually represent categorical variables, discrete variables
or continuous variables in class interval groups.
Histogram

• A histogram is a data visualization type designed to show the


distribution of interval or continuous data. In histograms, data is
shown in the form of contiguous bars, where each bar corresponds
to a data range or a bin.
• You would use a histogram when you want to visualize the
frequency or count of data points within each of those data ranges
and understand how the data is distributed.
• There are two axes on a histogram.
• The horizontal axis (x-axis) shows the range of values or bins into
which the data is divided. Each horizontal bar represents a range of
bins or data values.
• The vertical axis (y-axis) is the frequency or count of data points
that belong to each data range or bin on the x-axis.
Pie Chart
• A pie chart is a type of graph that represents the data in the
circular graph. The slices of pie show the relative size of the
data, and it is a type of pictorial representation of data.
• A pie chart requires a list of categorical variables and
numerical variables. Here, the term “pie” represents the whole,
and the “slices” represent the parts of the whole.
• The “pie chart” is also known as a “circle chart”, dividing the
circular statistical graphic into sectors or sections to illustrate
the numerical problems. Each sector denotes a proportionate
part of the whole.
• To find out the composition of something, Pie-chart works the
best at that time. In most cases, pie charts replace other graphs
like the bar graph, line plots, histograms, etc.
Multiple Regression Explanation,
Assumptions, and Interpretation
• There are many types of regression models; but, here we will
deal only with some three types of regression models.
1. Simple regression model
2. Multiple regression model
3. Multivariate regression model
1. Simple regression model: Simple regression model is a
statistical equation that characterizes the relationship between a
dependent variable and only one independent variable.
2. Multiple regression model: Multiple regression model is a
mathematical model that characterizes the relationship between a
dependent variable & two or more independent variables.
Cont’d…
• Multivariate regression model is algebraic system of
equations that characterizes the relationship among more than
one dependent variable & one or more independent variables
through a set of statistical regression models.
What's your approach to interpreting
regression analysis results?
• Regression analysis is a powerful tool for data analysis that
allows you to explore the relationship between a dependent
variable and one or more independent variables.
• However, interpreting the results of a regression analysis can
be challenging, especially if you are not familiar with the
assumptions, limitations, and pitfalls of the method.
• In this course, you will learn a practical approach to
interpreting regression analysis results, based on four key
steps: checking the model fit, examining the coefficients,
testing the hypotheses, and assessing the validity.
1. Check the model fit

• The first step in interpreting regression analysis results is to


check how well the model fits the data.
• This means evaluating how closely the predicted values match
the observed values, and how much of the variation in the
dependent variable is explained by the independent variables.
• There are several statistics that can help you assess the model
fit, such as R-squared, adjusted R-squared, standard error, F-
test, and residuals.
• You should look for a high R-squared, a low standard error, a
significant F-test, and normally distributed residuals with no
outliers or patterns.
2. Examine the coefficients

• The second step in interpreting regression analysis results is to


examine the coefficients of the independent variables.
• The coefficients tell you the direction and magnitude of the
effect of each independent variable on the dependent variable,
holding all other variables constant.
• You should pay attention to the sign, size, and significance of
the coefficients, and compare them with your expectations and
prior knowledge.
• You should also look for any signs of multicollinearity, which
is a situation where two or more independent variables are
highly correlated and affect the reliability of the coefficients.
3. Test the hypotheses
• The third step in interpreting regression analysis results is to test the
hypotheses that you have formulated before conducting the
analysis.
• The hypotheses are statements about the relationship between the
dependent variable and the independent variables, such as whether
there is a positive or negative effect, or whether there is a difference
between groups or levels.
• To test the hypotheses, you need to look at the p-values and
confidence intervals of the coefficients, and compare them with a
significance level that you have chosen.
• The p-value tells you the probability of observing a coefficient as
extreme or more extreme than the one obtained, assuming that there
is no effect. The confidence interval tells you the range of values
that contain the true coefficient with a certain level of confidence.
• You can reject a hypothesis if the p-value is lower than the
significance level(5%,10%), or if the confidence interval does not
include zero.
4. Assess the validity
• The fourth and final step in interpreting regression analysis results
is to assess the validity of the model and the assumptions that
underlie it.
• The validity refers to how well the model represents the true
relationship between the variables, and how generalizable and
robust it is to different situations and data sets.
• To assess the validity, you need to check whether the assumptions
of the regression method are met, such as linearity, independence,
homoscedasticity, and normality. You can use various diagnostic
tests and plots to check these assumptions, and apply appropriate
transformations or corrections if they are violated.
• You should also consider any potential confounding factors,
omitted variables, or endogeneity issues that might bias the results,
and address them with suitable methods, such as adding control
variables, using instrumental variables, or applying fixed effects.
Key Difference Between R-squared and
Adjusted R-squared for Regression Analysis
R-Squared
• R-squared measures the proportion of the variance in the
dependent variable explained by the independent variables in the
model.
• It ranges from 0 to 1, where 0 indicates that the model does not
explain any variability, and one indicates that it explains all the
variability.
• Higher R-squared values suggest a better fit, but it doesn’t
necessarily mean the model is a good predictor in an absolute
sense.
• R-squared is a goodness-of-fit measure that tends to reward you
for including too many independent variables in a regression
model, and it doesn’t provide any incentive to stop adding more.
Some Problems with R-squared
• Unfortunately, there are yet more problems with R-squared that
we need to address.
• Problem 1: R-squared increases every time you add an
independent variable to the model. The R-squared never
decreases, not even when it’s just a chance correlation between
variables. A regression model that contains more independent
variables than another model can look like it provides a better fit
merely because it contains more variables.
• Problem 2: When a model contains an excessive number of
independent variables and polynomial terms, it becomes overly
customized to fit the peculiarities and random noise in your
sample rather than reflecting the entire population. Statisticians
call this overfitting the model, and it produces deceptively high R-
squared values and a decreased capability for precise predictions.
• Fortunately for us, adjusted R-squared and predicted R-squared
address both of these problems.
Cont’d…
Adjusted R-Squared
• Adjusted R-squared addresses a limitation of Adjusted R
Squared, especially in multiple regression (models with more
than one independent variable).
• While R-squared tends to increase as more variables are added to
the model (even if they don’t improve the model significantly),
Adjusted r squared vs adjusted r squared penalizes the addition
of unnecessary variables.
• It considers the number of predictors in the model and adjusts R-
squared accordingly. This adjustment helps to avoid overfitting,
providing a more accurate measure of the model’s goodness of
fit.
• Use adjusted R-squared to compare the goodness-of-fit for
regression models that contain differing numbers of independent
variables.
Comparison

• R-squared will stay the same when adding more predictors,


even if they are not contributing meaningfully. It may give a
falsely optimistic view of the model.
• Adjusted R-squared is more conservative and will decrease if
additional variables do not contribute to the model’s
explanatory power.
• As a rule of thumb, a higher R-squared or Adjusted r squared
vs adjusted r squared is desirable, but it’s crucial to consider
the context of the specific analysis and the trade-off between
model complexity and explanatory power
Cont’d…
• Let’s say you are comparing a model with five independent
variables to a model with one variable and the five variable
model has a higher R-squared. Is the model with five variables
actually a better model, or does it just have more variables? To
determine this, just compare the adjusted R-squared values!
• The adjusted R-squared adjusts for the number of terms in the
model. Importantly, its value increases only when the new term
improves the model fit more than expected by chance alone. The
adjusted R-squared value actually decreases when the term
doesn’t improve the model fit by a sufficient amount.
• The example below shows how the adjusted R-squared
increases up to a point and then decreases. On the other hand, R-
squared blithely increases with each and every additional
independent variable.

You might also like