For time series data, use a line chart. Linear y = a + bx Logarithmic y = ln(x) Polynomial (2nd order) y = ax2 + bx + c Polynomial (3rd order) y = ax3 + bx2 + dx + e Power y = axb Exponential y = abx (the base of natural logarithms, e = 2.71828…is often used for the constant b) Right click on data series and choose Add trendline from pop-up menu Check the boxes Display Equation on chart and Display R-squared value on chart R2 (R-squared) is a measure of the “fit” of the line to the data. ◦ The value of R2 will be between 0 and 1. ◦ A value of 1.0 indicates a perfect fit and all data points would lie on the line; the larger the value of R2 the better the fit. Linear demand function: Sales = 20,512 - 9.5116(price) Line chart of historical crude oil prices Excel’s Trendline tool is used to fit various functions to the data.
Exponential y = 50.49e0.021x R2 = 0.664
Logarithmic y = 13.02ln(x) + 39.60 R2 = 0.382 Polynomial 2° y = 0.13x2 − 2.399x + 68.01 R2 = 0.905 Polynomial 3° y = 0.005x3 − 0.111x2 + 0.648x + 59.497 R2 = 0.928 * Power y = 45.96x0.0169 R2 = 0.397 The R2 value will continue to increase as the order of the polynomial increases; that is, a 4th order polynomial will provide a better fit than a 3rd order, and so on. Higher order polynomials will generally not be very smooth and will be difficult to interpret visually. ◦ Thus, we don't recommend going beyond a third-order polynomial when fitting data. Use your eye to make a good judgment! Regression analysis is a tool for building mathematical and statistical models that characterize relationships between a dependent (ratio) variable and one or more independent, or explanatory variables (ratio or categorical), all of which are numerical. Simple linear regression involves a single independent variable. Multiple regression involves two or more independent variables. Finds a linear relationship between: - one independent variable X and - one dependent variable Y First prepare a scatter plot to verify the data has a linear trend. Use alternative approaches if the data is not linear. Size of a house is typically related to its market value. X = square footage Y = market value ($) The scatter plot of the full data set (42 homes) indicates a linear trend. Market value = a + b × square feet Two possible lines are shown below.
Line A is clearly a better fit to the data.
We want to determine the best regression line. Market value = 32,673 + $35.036 × square feet ◦ The estimated market value of a home with 2,200 square feet would be: market value = $32,673 + $35.036 × 2,200 = $109,752
The regression model
explains variation in market value due to size of the home. It provides better estimates of market value than simply using the average. Simple linear regression model:
We estimate the parameters from the sample data:
Let Xi be the value of the independent variable of the ith
observation. When the value of the independent variable is Xi, then Yi = b0 + b1Xi is the estimated value of Y for Xi. Residuals are the observed errors associated with estimating the value of the dependent variable using the regression line: Data > Data Analysis > Regression Input Y Range (with header) Input X Range (with header) Check Labels
Excel outputs a table with
many useful regression statistics. Multiple R - | r |, where r is the sample correlation coefficient. The value of r varies from -1 to +1 (r is negative if slope is negative) R Square - coefficient of determination, R2, which varies from 0 (no fit) to 1 (perfect fit) Adjusted R Square - adjusts R2 for sample size and number of X variables Standard Error - variability between observed and predicted Y values. This is formally called the standard error of the estimate, SYX. 53% of the variation in home market values can be explained by home size. The standard error of $7287 is less than standard deviation (not shown) of $10,553. Residual = Actual Y value − Predicted Y value Standard residual = residual / standard deviation Rule of thumb: Standard residuals outside of ±2 or ±3 are potential outliers. Excel provides a table and a plot of residuals.
This point has a standard
residual of 4.53 Linearity examine scatter diagram (should appear linear) examine residual plot (should appear random) Normality of Errors view a histogram of standard residuals regression is robust to departures from normality Homoscedasticity: variation about the regression line is constant examine the residual plot Independence of Errors: successive observations should not be related. This is important when the independent variable is time. Linearity - linear trend in scatterplot - no pattern in residual plot Normality of Errors – residual histogram appears slightly skewed but is not a serious departure Homoscedasticity – residual plot shows no serious difference in the spread of the data for different X values. Independence of Errors – Because the data is cross-sectional, we can assume this assumption holds. A linear regression model with more than one independent variable is called a multiple linear regression model. We estimate the regression coefficients—called partial regression coefficients — b0, b1, b2,… bk, then use the model:
The partial regression coefficients represent the
expected change in the dependent variable when the associated independent variable is increased by one unit while the values of all other independent variables are held constant. Predict student graduation rates using several indicators: Regression model
The value of R2 indicates that 53% of the variation in the dependent
variable is explained by these independent variables. All coefficients are statistically significant. A good regression model should include only significant independent variables. However, it is not always clear exactly what will happen when we add or remove variables from a model; variables that are (or are not) significant in one model may (or may not) be significant in another. ◦ Therefore, you should not consider dropping all insignificant variables at one time, but rather take a more structured approach. Adding an independent variable to a regression model will always result in R2 equal to or greater than the R2 of the original model. Adjusted R2 reflects both the number of independent variables and the sample size and may either increase or decrease when an independent variable is added or dropped. An increase in adjusted R2 indicates that the model has improved. 1. Construct a model with all available independent variables. Check for significance of the independent variables by examining the p-values. 2. Identify the independent variable having the largest p- value that exceeds the chosen level of significance. 3. Remove the variable identified in step 2 from the model and evaluate adjusted R2. (Don’t remove all variables with p-values that exceed a at the same time, but remove only one at a time.) 4. Continue until all variables are significant. Banking Data
Home value has the
largest p-value; drop and re-run the regression. Bank regression after removing Home Value
Adjusted R2 improves slightly.
All X variables are significant. Multicollinearity occurs when there are strong correlations among the independent variables, and they can predict each other better than the dependent variable. ◦ When significant multicollinearity is present, it becomes difficult to isolate the effect of one independent variable on the dependent variable, the signs of coefficients may be the opposite of what they should be, making it difficult to interpret regression coefficients, and p-values can be inflated. Correlations exceeding ±0.7 may indicate multicollinearity The variance inflation factor is a better indicator, but not computed in Excel. Colleges and Universities correlation matrix; none exceed the recommend threshold of ±0.7
Banking Data correlation matrix; large correlations exist
If we remove Wealth from the model, the adjusted R2 drops to 0.9201, but we discover that Education is no longer significant. Dropping Education and leaving only Age and Income in the model results in an adjusted R2 of 0.9202. However, if we remove Income from the model instead of Wealth, the Adjusted R2 drops to only 0.9345, and all remaining variables (Age, Education, and Wealth) are significant. Identifying the best regression model often requires experimentation and trial and error. The independent variables selected should make sense in attempting to explain the dependent variable ◦ Logic should guide your model development. In many applications, behavioral, economic, or physical theory might suggest that certain variables should belong in a model. Additional variables increase R2 and, therefore, help to explain a larger proportion of the variation. ◦ Even though a variable with a large p-value is not statistically significant, it could simply be the result of sampling error and a modeler might wish to keep it. Good models are as simple as possible (the principle of parsimony). Regression analysis requires numerical data. Categorical data can be included as independent variables, but must be coded numeric using dummy variables. For variables with 2 categories, code as 0 and 1. Employee Salaries provides data for 35 employees
Predict Salary using Age and MBA (code as
yes=1, no=0) Salary = 893.59 + 1044.15 × Age + 14767.23 × MBA ◦ If MBA = 0, salary = 893.59 + 1044 × Age ◦ If MBA = 1, salary =15,660.82 + 1044 × Age
(Ebook) Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman, Jennifer Hill ISBN 9780521867061, 0521867061 - Download the ebook and explore the most detailed content
(Ebook) Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman, Jennifer Hill ISBN 9780521867061, 0521867061 - Download the ebook and explore the most detailed content