0% found this document useful (0 votes)
19 views25 pages

FODS Unit-3

Lecture notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
19 views25 pages

FODS Unit-3

Lecture notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 25
Unit - 3 DESCRIBING RELATIONSHIPS Correlation — Scatter plots - correlation coefficient for quantitative data — computational formula for correlation coefficient — Regression’ |- regression line — least squares regression line — Standard error of| estimate — interpretation of 1° —multiple regression equations — regression towards the mean 3.1 CORRELATION Correlation measures the relationship between two variables.The relationship between the computer skills and GPA of the student can be an example for correlation. Correlation is a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables. The original data must consist of actual pairs of observations. Two variables are related if pairs of scores show an orderliness that can be depicted graphically with a scatterplot and numerically with a correlation coefficient. Types of Correlation ¢ Positive Correlation occurs if the pairs of scores tend to occupy similar relative positions (high with high and low with low) in their respective distributions. In other words, it is a relationship between two variables in which both variables move in the same direction. Therefore, when one variable increases the other variable increases, or when one variable decreases the other decreases. Examples ° (Height, Weight) * (Temperature, Ice Cream Sales) Se Foundations of Data seg, ee Fig, 3.1, Positive Correlation MOITAJIRAOD he + Negative: Correlation occurs ifthe’ pairsi0P scores tend: 10° OCeNpY diss ‘coeelative positions: (high wide ‘low! and ‘wick versa) in their” reapeoug: 1 0 dxibutions ln othes: words, itis: relationship Between two variables ow siowhiche am imetease! im: one ‘variable is-assoviated! witt\a'dezrease Se Ft Theat Examples... jolie aids ie + (Bxercise, Body Fat) Masinittyos woitulysaey 4 ‘+ (Watching movies, Exam scores) soinetrr8) Yo aga 2 iil ‘Klomotors Run por week Fig. 3.2. Negative Correlation We can conclude from the above graph that as the number of kilometers run each week increases, a person's weight decreases. * Little or No Correlation occurs if there is no regularity is apparent among the pairs of scores i.e., there is no relationship between two variables. get 33 | Bxamples: noite) cue va Ghoe Sines Movies Watched) sviniyrv0s 1 oa! Cole Consist ence) nA calqant ‘The above scatter, plot noitulsres “Wit baa% for diameter versus height sho. j hows an example of little or aii ms a et EER ROSSER -/ thle! valué’ for ‘a’ crteldtion ‘Geetficiend 1¢ e . ~¥ and A where: oY indicates’ a ‘perfectly hépative’ tihidar’ correlation ‘between'two variables 11, 0 indicates no, linear, comelation, between, two, variables. p14 / (71 4" indicates perfectly postive lineaygdrrelatcn, Getedd Wo variables \Differefice between Positive 'Corretation ‘and ‘Negative ‘Cortelation wow # When. there is.a, positive comelation (>,0). between. yo random variables, ‘roi cORR, Variables moves proportional..to,,the, other, variable, If, one variable increases the other increases,.{t one variables, decreases, the, other decreases. 100. i ts When there is a negative correlation (r<0) between the two random ‘variables, variables’ moves opposing ‘eacl!Wlhef! If ‘one’ Wuiridble ‘creases WW the other decreases aind vite vers" °° |e “A\line approximating a! positive correlation has positive gradiént, and a line approximating negative correlation has w negative \ gradient - 34 Foundations of Data 34 Sy Correlation versus Causation In a statistical context, correlation refers to the relationship of WO oF ty variables. If the value of one variable increases or decreases, 80 does the value y the other variable (although it may be in the opposite direction). Causation indicates that one event is the result of the occurrence of the event; ie. there is a causal relationship between the two events, This is also refeng, to as the cause and effect. Theoretically, the difference between the two types of relationships is easy 1 identify. Example: An action or occurrence can cause another. ‘¢ Smoking causes an increase in the risk of developing lung cancer In practice, however, it remains difficult to clearly establish cause and effes when compared with establishing correlation. ‘Need for correlation (@) Prediction: Correlation can be used to make predictions. If two variable, correlate in the past, then they will continue to correlate in the future. The value of one variable that is known now can be used to predict the value that the other variable will take on in the future. For example, entrance test scores correlate with GPA. Higher the entrance test scores then higher the GPA. (b) Validity: Consider a new test of intelligence was developed to determine whether it is really measuring intelligence by correlating the new test's scores with the scores that the same people obtained on standardized 1Q tests, or their scores on problem solving ability tests, or their performance fon leaming tasks, etc. This is a process for validating the new test of intelligence. The process is based on correlation. (©) Reltability: Correlations can be used to determine the reliability of some measurement process. For example new IQ test can be administered on two different occasions to the same group of people and check what the correlation is. If the correlation is high, the test is reliable. If it is low, i is not. (4) Theory Verification: Many psychological theories make specifi predictions about the relationship between two variables. For example, i is predicted that parents" and children’s intelligence are positively relatel, This can be tested by administering IQ tests to the parents and. thei children, and measuring the correlation between the two scores. felaionships ot! 35 in-saistics, there ae different types of correa "oqelaion, Spearman correlation, Point-Bis jation et. tion: Pearson correlation, Kendall “ iserial correlation, Cramér's V corel (o) Pearson correlation (r) is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. For example, in the stock market, if the relationship between how two stocks are related to each other to be measured then the Pearson correlation is used to measure the degree of relationship, (b) Kendall rank correlation is a non-parametric test that measures the strength of dependence between two variables. Consider two samples, 2 and b, where each sample size is n, then the total number of pairings with aand b is n*(n~1)2. () Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables. The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal (4) Point-Biserial correlation is used to measure the strength and direction ‘of the association that exists between one continuous variable and one dichotomous variable. Example: Association between cholesterol concentration and smoking status (jc., continuous variable is cholesterol concentration and your dichotomous variable would be smoking status which has two categories: smoker and non-smoker). (@) Cramér’s V correlation: In statistics, Cramér's V correlation is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson's chi-squared statistic 3.2 SCATTER PLOTS A scatterplot is a graph containing a cluster of dots that represents all pairs of seores. Scatter plots are used to observe relationships between variables.They are ‘gaphs that present the relationship between two variables in a dataset. It represents ‘at points on a two-dimensional plane or on a Cartesian system. © The parameter that is systematically incremented and/or decremented by the other is called the control parameter or independent variable and is ‘customarily plotted along the horizontal axis (X ~ axis). + ‘The measured or dependent variable is customarily plotted along the vertical axis (Y — axis). If no dependent variable exists either type of variable can as Fours 1 Date Seong, $8 a Seen .\ ibe plated en either of the:anis.\dn-such tn'casBiithe scatter plov wil illu ‘only theidegree of correlation! (not causation) betwee two-variables, Scatter plots are also called a scatter graph, scatter chart or scatter 7" 4s one, of the, seven. basic Jools) of quality onirol.A. scatter lot ap FUBECSt various kinds, of correlations. between, variables, witha certain confidence interval, "uta, a \eoethteaceMtetva ‘the’ Tange! Of estitites ‘or’ an ey w ecg ‘intervals ‘computed ‘ata chsh confiderice'tevel i.e, te 958 confidence lvl is hint Comin, Bat oes HE tl 8 908 or Sag sometimes, used. The, confidence, level, represents, the Proportion Errespondingonfidencemrets ok contin te foe aie the paranecen po ‘example, aut of all. intervals computed at she, 95% level, 95%, of them should conta the parameter’s true value. Factor. affecting the, width, of the, confidener, interval, includes... ©!" Cotifidence level ¥ fe Sample i286. © Variability in the satnple. ‘Awe all other’ patameters being thé “same”! 7""? 1" "om Large sample went produce(qimariewer. cadence eral. y ‘Greater vatiability in the sample produces a wider confidence interval ‘© Higher confidence level: would demand a wider confiderice interval + Form: Linear of nonlinear’! * Direction: Positive, negative or no correlation * Strength of the association: Strong, moderately strong, or weak 1); Bregence of-any outliers: Data points that are. unusually far away from the oy general ;pattern, E ‘Types of Relationship/Corretation > ~ Positive elationship/correlation: A’ positive telationship exists when one variable decreases’ as the other variable decreases,’ one. variable increases YPhile, the other. increases. 2 ABxample: impact of height od ‘the weight of a person He: Impact of smoking on ite expectancy Little or No Relationship/correlation; between the two variables, . . Ke means there is no relationship example: Impact of height of @ person on tite expectancy Heavy Sohn (years) Fig. 3.5.Negative relationship in satterplot Linear Relationship (or linear association) is a statistical term used to describe a staight-line relationship between two variables. Linear relationships can be expressed tither in @ graphical format where the variable and the constant are connected via a stright line or ina mathematical format where the independent variable is multiplied ty the slope coefficient, added by a constant, which determines the dependent variable. ‘The necessary criteria for linear relationship: * Linear relationship cannot consist of more than two variables. 38 Foundations of Data g,, a Expectancy (years) 2 8 55 ; : ri ees Height inches) Fig, 46:No relationship in seatterplot [All of the variables in the equation must be to the first power, Equation must graph as a straight line, » « # . * . o . . 0 . . 0 . ° le of} ° 7: 3 4 Fig. 37.Perfect positive correation » « . * Cy . . * . ~ = . 2» . . 0 . . 7 2 3 4 Fig. 38.Perfect negative correlation , ve ggopeptnmmomioe le matically, @ linear relationship is one that satisfies the ‘equation: yemxtb GA) Where, 2m ~ slope © by intercept 1m this equation, x andy are wo variables which are related by the parameters mand. Where m=(2~yi Mx —21) w= G2) Nonlinear Relationship: A nonlinear relationship between variables is a elationship whose scatter plot does not resemble a straight Line. It could resemble a farve or not really resemble anything. An increase in one variable does not result in 4 proportional increase or decrease in the other variable. Some of the types of nonlinear felationship ae: + Quadratic relationship © Cubic relationship «Exponential relationship «Logarithmic Relationship «Cosine Relationship Step Relationship uodratec Any relationship that cannot be summarised by a straight line is a non-linear relationship as shown in the above scatter plot. Non-linear models, like random forests and neural networks, can model non-linear relationships, 2 relationship that can be described best ara ei iaionship beteen wo Variables WhETE 28 One van! incre, x0 does the other variable, but only up to 2 ceiain point. after which = tne warble comines 19 increase, the ater decreases, This kind of curying® ionship will some up ith an ivered-U. The ater type of curvilinear relating iS one where as one variable increases the other decreases up 19 2 certain poin p> eat variables increase together. This will give you 2 U-shaped curve Examples Staff cheerfulness and customer satisfaction The more cheerful a service staff is, the higher the customer satisfaction is only up to a certain point. When a service siaff is too cheerful it might be perceive, by customers as fake or annoying. bringing down their satisfaction level Job satisfaction and Voice behaviour Voice behaviour is a mechanism through which employees can help the organization adjust to the current business environment and remain innovative. Th, relationship of low self-protective voice behaviour and job satisfaction presents 4 shaped curve. ik can be classified as (i) Strom. sepaive relationship on correlation is x negative gradient, Low Medum High Lowel of Araioty Fig. 3.10: Curvilinear Relationship (Inverted U shaped curve) in scatterplot Example of Curvilinear relationship between anxiety and exam score shown is ———_ the above scatterplot where with the increase in anxiety there is an increase in the Seong Negative Metrate Norge Wiese Neca score obtained by students however when anxiety goes beyond a certain point tee Fie, 212: Strength of the restemiicoreion the score starts to fall down (negative correlation). Thus in Curvilinear Relationship. a I Sometimes the data points in a scaner plot form distinct groups. These gremps at called clusters, vr | 342 Outlier In statistics, an outlier is a data point that differs significantly from 9 observations. An outlier may be due to variability in the MEBSureMeENt OF iat indicate an experimental error ‘The data Scatter plots often have a pattern. not fit the pattern, The isolated data point in Fig. help to find multiple types of outliers point is called an outlier if it q, 3.14 is the outlier. Scatterplots cay ‘Some outliers have extreme values. These outliers are distanced from other day, points, But there is no special rule to indicate whether or not a point is an out in a scatter plot. Fig. 3.14: Outlier “The pattern of dots on a scatterplot allows you to determine whether a relations cor correlation exists between two continuous variables. If a relationship exists, te scatterplot indicates its direction and whether it is a linear or curved relations Fitted line plots are a special type of scatterplot that displays the data points along relations x pe iat fitted line for 2 simple regress ‘6 ion ; vit model fits the daa Model. This allows in evaluating on how eaters ATE SEU 10 SESS the fates inte 4 Examine the relationship between two o . /ariables, CCheck for outliers and unusual observa ions Create a time series plot with . ith irregular timecd: «Evaluate the fit of a regression model “a Correlation and regression anal ssesing the relationships between Jets are the prima continuous daca methods for statically 3a CORRELATION COEFFICIENT For Quanmiranive parr Conelation Coeicent is a number between —1 and A sgainsip PeIveen ps of variables Coneaion vai, tit deserts the Species the linear relationship between pairs of varia Seq on as r, 7 titative data ‘his is named in honor of the British sciemist Karl Peas conelation coefficient, r. can equal any value between -1.00 and iy Pearson fo relationship). Values of r closer to ~1 or | indicate a svenger seamen ns ralues closer 10 0 indicate a weaker relationship. The coeffisien: it aff aes varity of factors, 80 it is always best to also plot your two véiabes asa waterdon ‘The mathematical property of the is arian under Separate hangs in locaton and ca wee a i ig, X can be transformed to a+ OX and Y can be wanfomed wea mt where a, b. ¢, and dare constants with 8, d > 0, withou tation oefficient. (This holds for both coefficients.) it changing the correlation the population and sample Pearson correlation Key Properties of r ‘The two properties are: The sign of r indicates the type of linear relationship, whether positiv negative. VAAL The numerical value of r. without regard to sign, indicates the strength of the linear relationship. A number with a plus sign (or no sign) indicates positive relationship, and a \ number with a minus sign indicates a negative relationship. - St ett Se, Strength) of a Relationship Gcrength) of the rela ‘comelation coefficient measures the ee The strength of the fin Finally, discussed only m lines, perweon to variables. The Measures NS renonship between twO variables: Nogetve ‘correlation | Foundations 314 ‘Two specific strengths are: 7 diwansids lationship is a dot cluster that eqt o a s ee a oe relationship between two variables. In practice, matt relationships are most unlikely, When two variables are exactly (ing, related the correlation coefficient is either + 1.00 or — 1.00. ‘They are x to be perfectly linearly related, either positively or negatively, No relationship: When two variables have no relationship at al yy correlation is 0.00. “There are different types of correlation coefficients. Correlation coefficients associated with measuring a linear or non-linear relationship. The table presents type of relationship, levels of measurement and data distribution for the vatigg correlation coefficients. ) Table 3.1. Comparison of correlation coefficients en io of | Levels of measurement | Data distributiog Pearson's rho [Linear Two quantitative (interval or|Normal distribution ratio) variables | [Spearman's tho| Non-linear [Two ordinal, interval or ralio|Any distribution variables Point-biserial | Linear (One — dichotomous (binary) Normal distribution variable and one quantitative interval or ratio) variable UN ee | 315 i Type of (cae relationship | Levels of measurement | Data ditribua =v |Nowlinear [Two nominay vor ee ay minal variables Any distibution (Coan [Non-linear tw . eal variables titel “or ratio/Any distibation jou compare hours worked and poly te for their Work. hf gees 2 slespeon who charges When interpreting the value of °r° utmost care i must be taken. It is possi correlations between many variables; however the rata can be us ® aiher factors and have nothing to do with the two variables being’ considered sample Sales of ice creams and the sales of sunscreen can i across can increase and decrease a year in a systematic manner. but it would be a relationship that would be due to the effects of the season (ie. hotter weather sees an increase in people wearing sunscreen aS well as eating ice cream) rather than due to any direc between sales of sunscreen and ice cream. an ‘The correlation coefficient should not be used i 0 (0 say anything about ca effect relationship. By examining the value of "r. o sat tee it may be concluded that (wo variables are related, but that ‘r' value does not tell us if one variable was of the change in the other. ‘ariable was the cause Restricted range refers to a range of values that has been condensed. or shortened. In the field of statistics. restricting the range means to limit the data in the population to some criterion, or use a subset of data to determine whether two sets of information are correlated, or connected Example ‘The entire range of GPA scores is 0 to 10.0. A restricted range could be 69 to 10.0. or 80 to 100. * Foundabons of Data Seng, i a matrix is @ table w ; é hich ariables. The matrix depi splays the correla : re atable. It is a pone colton tne al he pon pce a wvgualize patterns in the given data, Summarize a large dataset and to identify + Bach cell in a table contains the comeation coef ficient, + Correlation matrix is fi int ray eee Lae Getermines the correlation coefficients between wba | on |e] az | Sa [Hours spent studying 1.00 0.82 048 0.22 0.36 [Exam score 082 1.00 033 0.04 0.23 men ic eidae 1Q score 0.08 033 | 1.00 0.06 0.02 Restricted range affects comelation, When the range is restricted, the correlation HOW SPN he eel “ook | ore = a cee em dn, Te cs te eo aes rae tune, SOME | Oo [oor wi) io the correlation coefficient goes down to 0.72. In this particular example, restricting the range to the left of the line would reduce the correlation coefficient to 0.59 Although there is a very strong correlation overall (0.91), this association is weakened to 0.72 and 0.50 when the range is split into two. Therefore, while dealing with a restricted range and the correlation coefficient is small (or zero), do not draw the conclusion that there is not a correlation, The results may simply be a product of this peculiar occurrence. Fig. 3.17, Correlation Matrix For example, the correlation between “Hours spent studying” and “Exam score” is 0.82, which indicates that they are strongly positively correlated. More hours spent studying is strongly related to higher exam scores. The: correlation between “Hours spent studying” and “hours spent sleeping” is ~0.22, which indicates that they are weakly negatively correlated. More hours spent studying is associated with less hours spent sleeping, The correlation coefficients along the diagonal of the table are all equal to I because each variable is perfectly correlated with itself. Founations A O88 tian, 4 Varlations of the Correlation Matrtx ‘A correlation mausix is aymunetrical 8008 tis corelation coefficients shown in the mals Me redund Of the conrelation mauix will be displayed: Gingonel, Hence, ball of , Junt, Thus, sometiner only lat Hours spent{ 100 studying — Exam score on? 1Q score [008 Hours spent] =0.22 sleeping School rating 036 109) Hours spent Schoot studying = score rating ig. 3.18: Vartation of the Correlation Mitrtx ‘Applications of correlation matrtx © Summarizes a dataset. # Serves as u diagnostic for regression, © Used as an input in other analysis 3.4 COMPUTATIONAL FORMULA FOR CORRELATION COEFFICIENT Correlation coefMiclents are used 0 measure how song a relationship ix between two variables, The correlation coefficient. denoted by tells us how clonely data in a scatterplot fall along a straight fine. -The closer the absolute value of + is fo one, the bewer the data are described by a linear equation. If r= 1 or r= then the data set ix perfectly aligned, Daw ets with values of 1 close to zero show little to no straight-line relationship, ‘The computation formula for correlation coefficient SPry 83) 1° SS, 55, Where @ the two sum of squares terms in the denominator are defined as ng alone ot ‘ 4x0 Bre LPs yr coxy " A) SS, = Nye Ly. ray yy? of the sun OF AME PrOdUE® LHI In the py 35) ‘umeraior, $p, SP yy ® LIX ~ X) THO SPay Is defined as woe MY Vy x yy (2X) xy EHEY ee 36) or 8 wil 5, a Sy Mt he ud devon wef ie pat de tue For each pair of deviation aducts, nomi 1 eine ave of SP ether the value of ne Pentt nly the ree SPry, minors te rik eae ee ¥ “luearek srengih i . sums of produc CALCULATION OF F: COMPUTATION vorsuLa products, sues () Assign 1 Value 0/1, 10 express the umber of pairs of scores (i) Sum all wores for X and for ¥, (ii) Find the product of each iil of theve producta, PMT OF X and Y scores, one ata time, then add ivy Square each X value, one at time, then add all squared ¥ all squared X. valves, then add ll squared ¥ va (vi) Substitute numbers into formulas and solve for —_ me (vy) Square cach ¥ values, one at a time (sii) Substitute into formula and solve for Example Consider the following values of x andy Foundations of Data ee x Yr rer F r z 1 2 aa 153086 123974 389.36 oa 4 225372 1589.86 2876 ce 948 649 5249.00 79943 = 132, 14 768.16 1251713 17.4260) 1188 20255 37ST 2503224 72088574 ests en=26255 | @n=31731 em=2052 (EF) = 20858.74 ar / values the above table be computed with m Now input the va tee eal of the corelation coefficient in the following formula ngx)- EVE es en Vnde— Ea eEy EN! x Y xXeY r yr 35.21 B47 1530.46 1,239.74 1.88.36 43.01 524 2,253.72 1,849.86 2,745.76 72.45 89.44 6479.89 5.249.00 7,999.43 111.88 132 14,768.16 12517.13 17,424.00 31731 25,032.24 20,855.74 30.05835 | 262.55 Value of the Coefficient = 0.99932640. indicating @ positive corelation between cand y variables. 3.5 REGRESSION Regression is a statistical method to determine the relationship between one dependent variable and a series of other variables known as independent (explanatory) variables. A regression model is able to show whether changes observed in the dependent variable are associated with changes in one or more of the explanatory variables. It does this by essentially fitting a best-fit line and seeing how the data is dispersed around this line. aie AP" Forecasting sales ash forecasting ‘analyzing survey data stock prediction predicting the behavior of the Lipeathe linear sianeant 00 form of this techni else i ip between two Variable poset Linear regression eepticaly Sepicted Using a Staak ine yi les based on a line of best fit : b ervavable impacts a change inthe other the slope defining how the mee jon relationship represents the value of ceca ssintercept of a linear caper 15 2670- regression models often use a | determi net sere mol eastsquares approach 10 determi tne of rated by a mathematical funeion A ee fe od seg ibe distance between a data point and the rane abe of mean value of goat oe da set samples of Linear Regression Impact of GPA on College Admission © Impact of rainfall amount on crop yield Simple linear regression is used to estimate , siative variables. Simple linear regression is a ei Betwees ten Sijes cersin assumptions about the data St meaning that it ‘The assumptions are: © Homogeneity of variance (homoscedastcty): The j y of ): The sizeof the error in our prediction doesn’t change significantly across the pe I \alues of the independent Independence of observations: The observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among observations. © Normality: The data follows a normal distribution. Foundations Of Data Gey te ee oon tani a a jon: Yeurs of san additional asm a Linear regression makes 3” 7 dependent and dependent variable i tng, r Pertence Salary «The relationship berween the Ime Tass a straight line (rather yy, 48 | = Te ne of best fit through the daa Pe ’ a - io ‘sort of grouping mW Seen ree 7 mtn Martatng Spend 9. Comverse 7 ———_ 344450 son z al 103 as Jn 9 60 ae 7.00 — i een TVA Ly joao tsi 200 2 5 moo 002000 x“ Nene ‘wo 100 0 akg er i 13 Regresion — 462050 Fg S19: Linear a ash, : ss was : ‘ tied in behavioral and social sciences, finance a , 0 cenit epi te Mel een a 5 |r certs marketing spend amount and the achieved conversion rate in Percentage poigg is mT is shown in the scatter plot with the linear model. is Example Consider the dataset shown in the figure having years of experience and salary and the regression line for the same is shown in the scatter plot. Here depe Variable is salary and independent variable is Years of Experience. If there is ap increase in dependent variable when independent variable increases, then there ix 4 positive correlation among them. If it decreases, then there is a negative correlation among them. “The best fit line for the data is the one which produces least error or leaw square approximation error among all regression line’s that can be drawn. This method of finding best fit line is called Least Square Approximation Method Years of Experience Salary ; Yoors of Expacancen Ply, 3.21: Regrewion 16 sa 6029.0, ine plotued using, mewn From the above plot, it is observed that the re ‘pression line Ws far from w os a 317380 data points. This whole process is an ieratable one and will be continued unt the ; x = best fit line with the feast square approximation distance in obtained, 6 30 1500 | that there is not 2 strong i Maliple regression SSm=% there is a correlation between ears variable. Hr so asses Mat variable. Each of these relationships Wariables drive the dependent value independent variable. iy Fig. 322 Multiple Regression should be used when multiple independent variables eee a eet a singe pene variable. This is often the case of : fc h lable} maliple regression formula has multiple slopes (one for each variable) ant one Jaoone ee imerresed the same as a simple linear regression formula excest thar there are multiple variables which impact the slope of the relationship oe ssipear regression refers to a 2 oi GNP between 2 depen MT, Fon mode In other words. the relationship between fatale and independent ‘The sum of squares is used to determine the fitnes , . js cnputed by calculating the difference between the ee en a seitepsion moe wed cae ne ah 2 pf ‘ean functions. The nonlinear model is complex and, at the same time, creates accurate results. tne analysis develops a curve depicting the relationship between wanable: tact cy te daraset provided. The model offering great flexibility can create a curve that best sis the scenario This relationship can be anything from connecting time and population to investor sentiments and its nonlinear effect on stock market retums. Applications of nonlinear regression: ‘The nonlinear regression models are used for prediction. financial modeling. and forecasting purposes. Used in many fields and sectors like insurance, agriculture. finance, investing. machine learning Al. and understanding broader markets. (a) o yundations of Data Scien, Fou sees are rolinear ia Maus WE CAN fing biological proces .arch. A simple power func; (e) Since most sons in forestry ese function onlinear model apo om in ean 10 #8 ameter oF height i to relate tree volu example. ‘The use of a nonfinear model it a_wide- sorte it developing # wide-range colorless 8 ie Sample from the field of Chemistry. Ba, jon HCFC-22 formula development 3g statistical ween Correlation and Regression (@) . f formula used in the process of ‘ion, In research and it i wstins to the calibration probe” (©) he problem and derivin Table 3.2.Difference bet FOne affects the other variable Relationship between variables Variables move together Ix and ¥ ean be interchanged [Data represented by a single point [Cause and effect ry and ¥ cannot be interchanged [Data represented by a line 3.6 REGRESSION LINE ‘A regression line shows the connection It reflecis che relation between the dependent y variable an when there is a linear pattern 'A regression line is a line which is used 0 describe the behavior of a set of data. It displays the connection between scattered data points in any set. It gives the best trend of the given data. Regression lines are useful in forecasting procedures. ls purpose is to describe the interrelation of the dependent variable (y variable) with ‘one or many independent variables (x variable). ‘The equation obtained from the regression line acts as an analyst to forecas future behaviors of the dependent variables by inputting different values for the independent ones. Regression lines are used in finance to forecast commodity prices or valuing securities. Businesses can also use regression lines to foresee the relationships between different variables in accounting. such as sales, expenses, and inventory. n between scattered data points in any sey lependent x variable; Pee a form of linear regressior a 327 n Model is given below, general simple linear regression: prulpte linear regression: Y= 0+ bX, + boxy bye Fat bX+y «= BI) FOR Au. 3.8) wheres y- Dependent variable x- Explanatory (independent) variable(s) a-y intercept y= Beta coefficient - slope of the explanatory Variables i Regression residual or error term , Fig, 324, Simple linear regression Fig, 328. Multiple near properties of Simple Linear Regression For the regression line where the regression Parameters the properties are given as: bo and 6, are defined, ©The line reduces the sum of squared differences pe peaepetd rences between observed values ‘+The regression line passes through the mean of X and Y variable values + The regression constant (by) is equal to y-intercept the linear regression ‘+The regression coefficient (b4) is the slope of the regression line which is equal to the average change in the dependent variable (1) for a unit change in the independent variable (2). Foundations of Data gq __Foungatons of Pela ay 3 = Nonlinear regression an xd function ol : : i Nonlinear regression is @ cute redietion of population growth Over tn! used to predict a ¥ variable. I | 3 ; 6 5 6 68 6 7 8 9 n n B B 10 fg A 4 Q B 13 15 Foundations of Date Be, af Solution: ‘The total Sample mean of 72 ‘Table 34. Com number of measurements (1 iy putation of Sum ¢ 9 4 o 16 16 2.48 inches Standard deviation = f=" Standard estimate of error = 238 = 069 inches ini = ee ie nptioms a 2 neat 0 the regression equa us of HON requires that the wet sails Underlying fclationship w be yon w ue of te standard error of estimate, § ihe orininal Stern will be day we aS thay ee tine, This property ts a ‘am known itd eOtally abony g ey of barnes 1s an tamumptnn Meda Hon tent shithe groups belng compared This i qy wit shlar asin, i giterer resis because they i ar varlances | tests ECHUNE THEY are sensi tant ass in sofisc ult in biased and skewed teqq «2, t"Y “isimitariey, Uneven tment esl es in i egies Fesult, —x Fig. 326: Violation of ho rmoscedasticy assumpton. 39 INTERPRETATION OF -? Resquared is & goodness-of-fit measure for linear ey corlation coefficient, + is a statistical measure Gwin in the dependent variable that can be decent eee! PrOpetion of nibed by the independent variable. In statistics, the coefficient of determination, denote 2 the variation in the dependent variable that is omc ‘im he salt provides a Key interpretation ofthe colin ce a rneasure of predictive accuracy. that i ° = - that supplements the standard error of estimate, ~ Guy, ~ Bay eros fr all GE fiends when Me mean ie) is NS EU 1 page Predictive Brrors Tre following figure show ys the predict! > ¢shown as the mean fe all five tends % of each of their five P scores. ig. 427 Errors wsing Mean Fig, A25 Predictive F705 {06 five fe, “The above figure shows the coresponding predictive errors for all five friend, when a series of different 2° values, obtained from the least squares equation (show, as the least squares line), is used to predict each of their five ¥” scores. Positive and negative errors indicate that Y scores are either above oF beloy their comesponding predicted scores. Errors are smaller when customized predictions fof ¥ from the least squares equation is used (because X scores are known) than whey only the repetitive prediction of ¥ can be used (because X scores are ignored). Limitations of 77 © 7 does not apply to individual scores: Do not attempt (0 apply the variability interpretation of r to individual scores. a does not ensure cause does colton 7 a te cause andere The ae of at ind a “ie ates “e seam te og EH ae cia : Svefficient to obtain a value small values of 7? : = * The choice of the an Rue oF F does nat gu jictions are biased, A, nat =r it will not ine nate a of ren 5 oF low # i “ it does OC conVEY the reliability ea Wot necessarily good el regresion model is comet. There Model and whether chosen the 337 ; « a bigh 1° for a porty fied mv ant gp RE Bd model, of id ve aa L Go Consider the folowing danse with x and "and y values, nihil : x lo» 3 2 3 a 5 8 Z 20 9 Rw [2 ,_! wt iW 7 i. 3 Gi Compute necessary mettics 1 cakulae 2 (iii) Calculate R-Squared *l?l]?]*)* % 3 2 | ae | 6 > 23 | 4 | 576 | 120 | s 2s [3 | 7 [wo 7 | 49 | 20 | 0 | 0 3.38 —— aw py 135] 7 | 252 HT oot | 372 }37 | 1369 | 518 33) 1089 | 561 133 | Gaa7 | 2169 Gi) Calculate R-Squared ‘Substinite in the formula for expemEr-EN? Petazn-Ca EyvaLe- Felis « 2169) —(72) (22308 # BIB) — (a2)9p2* WE + (6447) — 223)4) P= 0.6686 ‘The n in the formula represents the number of © tums out to be n=8 observations in this example. ing x is i is the response variabie ‘Assuming x is the predictor variable and is in regression mocel, the R-suared for the model is 0.6686 This shows that 66.865 the variation in the variable y can be explained by variable + Resquared will give an estimate of the relationship berween movements of « dependent variable based on an independent variable's ‘movements. squared is always between 0 and 100%: © 0% represents a model that does not explain any of the variation in iy response variable around its mean. 100% represents a model that explains all the variation in the response variable around its mean. bservations in the dataset ay 3.10 MULTIPLE REGRESSION EQUATIONS Multiple Regressions are a method to predict the dependent variable with be help of two or more independent variables. While running this analysis, the max purpose of the researcher is to find out the relationship between the dependent variabe and the independent variables. Rak et ee to predict the soetden 4. to hel i atiable, ‘nich can help in predict mu gt wnat able 1 Serve the pun MEM varie eee rales are cei q le. It is used ring Whether the Predictor variable, “SiON anal when linear of vient variable les are Sou oe in the process ine . '0 help in predicting uliple Regression Equation js Mi predictor or X variable, Mut tw gonship between dependent and fe by the following equation, oreened r= XL + mk + mks 45 where Vis dependent variable, x3, x5 y, G21) regression, and b is the const 3 ae independent vari i of tant value, lables, m is the xample: Find out what is the rea imber of hours of study and ees the GPA of a class of students a & least q that contai “nen tin ndependent Variables and tamale is slop® the mu a the students GPA, the number of able 35 hours of ee and the height of the students, GPA Height [~~ — (inches) | Study Hours 29 6 7 3.16 F : 645 6 a & a 345 695 r 28 S a 3.63 @ : 281 8 5 233 5 7 235 a a 386 = , # 7 faintest OH Som, 340 ee “Tne regression equation for the example will be mt my # mat P Substituting the values FOF a ye 108 + 03-4 208 #(- 02) +0 me thod. Regression pl , nod. Regression plays resins are very use satisicil x . wate ie ons finance. A ft of freCASL TE is doe tg rar important me Tample, the sales of «particule cement came preity ttre be he sates picators that has & ¥erY OOH CTTAON With saath the help oF macro segment REGRESSION TOWARDS THE MEAN a tendency for SCOFeS. particu, 1 Mean refers 10 "rd the mean, Ut refers 10 the fact that if One sama, then the next sampling of the same random varie 3.1 Regression towards thi extreme scores, 10 shrink tow# fof a random variable is extreme. js likely to be closer to its mean. variables re sampled and the most eXtreme results picked out, then in many cases. & ‘second sampling of these picked-out variables ” rates tess extreme” resuls, closer 10 the initial mean of all of the variables, useful in designing any scientific experimen data analysis, or test, which intentionally selects the most exieme exes. [i india Gat fallow up checks may be useful in onder to avoid leaping 10 false concluscn pout these events. These events may be genuine extreme events. 2 comple meaningless selection due 10 statistical noise, or a mix of both cases. Hy wv towards the mean describes the feature that extreme outcomes ten mts, When extreme events such as unlikely success, rarity of occurrence of those events must by ibute causal events and interventions that may Regression towards the mean is Regressio to be followed by more normal ever or failures are witnessed, then the identified, This often leads us to attri have played no role in bringing about that normal event To avoid making this fallacy. itis necessary to explain either positive oF nega ‘event of outcome carefully . Example Can you spot anything wrong in these statement? pela — rr shits nay pvaspiring athlete marvels; ~ «AY Tsoaked it in 2 hot bow ot ¥terday aa Meer. TheTe MUSt be som Of gatliodnfye f00t was . & healing propel War nw i eer tic stif “ ties, . it fe | nto ee i can lead y A ? scent: an actor's her 5 = Most prestigious = line rather the, eve statements. ah “extreme than In ty a more typical one, S¥EHU (ciher in a po9g yee obs oF bad ne a ical pheno fextiie, a oy reversion reticent. Regression to men called repeal pial, te eve y 10 be followed by EMP es the ie ye emt the average or “mean”. ore typical ones. Over vie st HE time, oucomes eon eression t0 the mean second time you visit of . can fre (© your = 08 Sos plenty of unjustified mie cbserved in itis The . where the example. Regression t0 the mean expli ‘ou thought so highly of last ieee wy ite live up t Regression to the Mean seven SF ir earesion (0 the mean usually : ee tappens be sampling technique is 10 randomly sample rom i onl €M0E. A good 1 population, If 1. IF you don't fie. if You asymmetrically samy ple). the high or low for the average and therefore weeld ss ay Oe 7 saression to the te abnormal Regression 10 the mean can also happen because re back 0 the mean OU take a very smal . tunrepresentative sample, fe of Regression to the Mean If you land your plane badly (or ro jn all a de If you land beautifully (or roll ly 8 double si ine aes! Hime: your perfomance vil be dee to the mean”. ™ soungavons of Data ayaveeres ok ae Fi Fig, 32%: Regresiaa to the Sear Regression Fallacy Fallacy occurs whenever regression towards the TESA in as araheflect. cater than a chance. The regression fall: can te avoided by sng the subset of extreme observations into 060 ETOUPS group of erainees WOUKE Sontinoe Wy 4 Thar very good landings and reprimanded after ety POOT TANGMES. A sey rp of uainoss would not receive fedhack afer very good and ver) Bad Frevtect the second group woukd serve as 2 control for regression wands thet Fry hin toward the mean on their second landings would Be de 19 chang Mee imporant, any observed diflerence tetween the 40 STOUPS (that survive, Miieal analysis) would be Viewed as a real difference not attnbutable tp Pty regression effect. For instance. in a group of students, selected to attend 2 special program fy underachieving readers. might show an improvement, Whether this improvement ca, be ambuted 10 the special program or to 2 regression ffect requires informatgy from 2 cotrol group of similarly underachieving students who did not attend ty hat research with underachievers alway special program. It is crucial, therefore. control group for regression towards the mean. includes a Reference 1. Robert S. Witte and John S. Witte. “Statistics”. Eleventh Edition. Wily Publications. 2017. Part A REN ® we Ton MNASUTES THE PELAONSD bey {The relationship between ice 90 variables APUET til and GPA ‘of the st are the OPES ef Correlation? ident 1 agsitive correlation £ Negative corelation no corelation pth need fo correlation? x esiction vality peliability reo Verification s infcates thal oe eee hem of Te a tn hen ng See te the 630 © events. This is also Ds swe i a Tincor relaionship? Linear Relationship (oF linear association) speaghine eationship NetweEn M0 Satan) 4 Satta tem use wo dente "what is Nonlinear Relaonshi? nonlinear relationship between variables i spo me resemble 3 SIRE Hine. could Treamnie cation whose scarer px Sea mime ae Seer one me Sh ee Broporionl increase or fama in the ofher variable {List the types of nonlinear relationship. «Quadratic relationship © Cubic relationship ¢ Exponential relationship ‘© Logarithmic Relationship Foundations of Data Se 344 ————_——_— «Cosine Relationship ear relationship? Tee ltonship that can be deseibed best yj sa petwect 10 vaables WHETE 25 ONE vari ile but only up t0 a cera point, alter which te other decteases é 9. What is a curvilin Curvitinear Relationshl curved line, It is a type of relat increases, so does the other vari ‘one variable continues 10 increase, 10, What is an outlier? int i ier ‘The data point is called an out entree values, These outliers ae distanced from © " waned 1. What are the Key Properties of correlation coefficient F? The two properties are: ign of r indicates the type of linear celal af it does not fit the pattern. Outliers 4, her data. points. Ne ionship. whether postive a © The sig negative. «The numerical value ofr, without regard to sign. indicates the strenggh the linear relationship. 12. What is regression? Regression is a statistical dependent variable and a series o! variables. 13, What are the types of regression models? © Linear model © Non Linear model 14, Compare Correlation and Regression method to determine the relationship between f other variables known as independent (explanatgy, Regression Correlation ‘One affects the other variable | [Relationship between variables ‘Cause and effect X and ¥ cannot be interchanged) [Data represented by a line Variables move together IX and ¥ can be interchanged Data represented by a single point 1S, What is restricted range? Restricted range refers to a range of values that fas been condensed, oF shortenei Relat - ait ee ; mag ple: The etic ringe of Gpa oo = 5 gxamityo, oF 8.0 t0 10.0, Scores is woe 100.4 © is a regression line? Restricted 16 segcession line is a tine which A faplays the connection betwee at Ha of the given data, range could what i the Interpretation of 2p se squared cOmelation cot ion of the correlation coeff Brovides at ‘ ici SUS With inerpements the standart ero of xin? # Meu of pe Sipe i em, 2 only ah of predictive accuracy is Standard Error of Estimate? ive error ies WS a TOUH measure of i ‘measur deviate from their predicted Y yale, ire Of the aver an Tage amount of "MBE amount by which known Y yn ive the Least Squares Regression Equation Y=bX+q Where 4 Yrepresents the predicted value fo X represents the known value # cand P represent numbers calculate from the original core y, State the desirable property of least square regression? “ane 1p, Sate the Multiple Regression Equation. Ym, +mXy+ mi 46 Where ¥ is dependent variable, Xj, X3. X3 are independent variable ope of regression, «ind is the constant value ana mee

You might also like