0% found this document useful (0 votes)
0 views

Data analysis using stata

The document provides a comprehensive guide on data analysis using STATA, detailing steps to import data, perform basic operations, and generate statistical analyses. It covers commands for data manipulation, summary statistics, correlation, regression analysis, and graphical representations. Additionally, it explains the interpretation of regression results, including coefficients, R-squared values, and significance levels.

Uploaded by

lydia
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Data analysis using stata

The document provides a comprehensive guide on data analysis using STATA, detailing steps to import data, perform basic operations, and generate statistical analyses. It covers commands for data manipulation, summary statistics, correlation, regression analysis, and graphical representations. Additionally, it explains the interpretation of regression results, including coefficients, R-squared values, and significance levels.

Uploaded by

lydia
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

TOPIC 2; DATA ANALYSIS USING STATA

Before starting to work with STATA, ensure you have the data that you want to work with,

preferably in an excel spreadsheet.

For example, the STATA folder has an excel file named: Data on GPA, TUCE, PSI and

GRADE. This data set consists of 4 variables and 32 observations.

To start STATA, click on the STATA folder provided, then double click the application.

This will open the STATA interface, and you will notice that STATA has four windows as

follows:

Review window Results window

Variables window Command window

To start the process of data analysis, click as follows: file – log – begin. Then stata will ask you

to provide a name for your file, say ANALYSIS 1.

Now, minimize the stata application, then open the excel file containing the data on: gpa, tuce,

psi and grade. Copy this data from excel (you can close the excel file after copying), then

maximize the stata application. In the command window, type: edit then press enter. This will

bring the stata spreadsheet. Now, you can paste your data here (in the cell highlighted with blue

– the cell on the top left of your stata spreadsheet). You may now close the data editor. Notice in

the results window, the result is “6 variables and 32 observations have been pasted in to the data

editor. Also, when you check in the review window, you will see a history of all the commands
that you are working with, and this is good for replication purposes. Finally, the variables

window displays the variables that you are working with. Having pasted the data into the data

editor, now you are ready to begin the process of data analysis.

However, the makers of stata have also installed some example data sets into stata, to aid in

teaching and training. Therefore, instead of using our data on gpa, tuce, psi and grade, it would

be more ideal if we were to use the data that the makers of stata have already installed. To do

away with the data we have just entered, type clear in the command window, then press enter. If

you type a command in the command window, you always have to press ENTER so as to

execute that command.

One of the famous example data sets that have been installed into stata is the 1978 Automobile

data which shows data on various automobiles as at 1978 and their characteristics. To get the

1978 automobile data, type use auto in the command window, then enter. Now, check in your

variables window. You will see that the variables are: make, price, mpg, rep78, headroom, trunk,

weight, length, turn, displacement, gear ratio and foreign. Thus, we have 12 variables.

To view the data, type browse in the command window then press enter. You will be able to see

12 variables and 74 observations.

To describe the data, type describe in the command window then press enter. You will be able to

see a description of all your variables in the results window (the dark screen). Make is the make

and model of the car, price is the price of the car, mpg is mileage per gallon, rep78 is the repair

record as at 1978, headroom is headroom in inches, trunk is trunk space in cubic feet, weight in

pounds, length in inches, turn is the turn circle in feet, displacement is displacement in cubic
inches, gear_ratio is Gear Ratio and finally, foreign is a dummy or indicator variable for car type

and it is defined as 1 if the car is foreign, and 0 if the car is domestic.

From the output, we notice that on storage type, some variables are string variables (str), others

are integer variables (int), while others are float variables. A string variable means that the

variable is not numeric but is in words or alphabet. Thus make is str18 which means that the

longest name in the variable make has 18 characters. Price, mpg, rep78, trunk and displacement

are int, thus they are integers. Headroom and gear_ratio are float variables which means that

their values have decimal points. Foreign is a byte which means that it is a dummy variable or

indicator variable.

To get summary statistics for the data, type summarize in the command window, then enter. The

summary statistics show the number of observations, the mean, the standard deviation, and the

maximum and minimum values. You can even copy these statistics from stata and paste them

into your word project for interpretation of the results (you could review what you learnt in

statistics or econometrics). The results are as follows:

Table 1: Summary Statistics for 1978 Automobile Data

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

price | 74 6165.257 2949.496 3291 15906

mpg | 74 21.2973 5.785503 12 41


rep78 | 69 3.405797 .9899323 1 5

headroom | 74 2.993243 .8459948 1.5 5

-------------+--------------------------------------------------------

trunk | 74 13.75676 4.277404 5 23

weight | 74 3019.459 777.1936 1760 4840

length | 74 187.9324 22.26634 142 233

turn | 74 39.64865 4.399354 31 51

displacement | 74 197.2973 91.83722 79 425

-------------+--------------------------------------------------------

gear_ratio | 74 3.014865 .4562871 2.19 3.89

foreign | 74 .2972973 .4601885 0 1

Source: Author

But, if you want to get more details about the summary statistics, type: summarize, detail in the

command window, and then enter. If you want summary statistics for only one variable with

details, say price, then type: summarize price, detail.

STATA also allows the user to generate new variables from the data set provided. Thus, we can

create product, square, square root, logarithm, reciprocal, and so on


- To create the product between mpg and weight, the command is: generate

productmpgweight = mpg * weight then enter

- To create the square of a variable (say mpg), the command is: generate squarempg =

mpg * mpg then enter.

- To create the square root of a variable (say price), the command is: generate sqrootprice

= price^0.5 then enter

- to create the natural logarithm of a variable (say headroom), the command is: generate

logheadroom = ln(headroom) then enter

- to create the reciprocal of a variable (say mpg), the command is: generate

reciprocalmpg = 1/mpg then enter

- In order to see your new variables, type browse in the command window, then enter.

Notice that the spreadsheet now contains the new variables and even in the variables

window, they are shown.

Graphics can also be done using stata. These include: scatter plots, line graph, bar graph, pie

chart, and so on.

- to create a scatter plot, between price and mpg, the command is: scatter price mpg then

enter
Figure 1: Scatter plot between price and mpg

Source: Author

- to create a line graph, between price and mpg, the command is: line price mpg then enter

- to create a bar graph, between price and mpg, the command is: graph bar price mpg

then enter

- to create a pie chart, between price and mpg, the command is: graph pie price mpg then

enter

- Repeat the above procedure but now using many variables rather than only two variables.

With stata, you can also perform correlation and regression analysis. For example to correlate

price and mpg, type correlate price mpg in the command window then enter. We notice that the

correlation coefficient between price and mpg is – 0.4686. There is a fair negative correlation
between price and mpg. Also, try: correlate price mpg rep78 weight length foreign then enter.

What can you say about the correlation coefficients given?

Stata also performs regression analysis, which is to find the effect of independent variables on

the dependent variable. In regression, the command is regress, followed by the dependent

variable, then followed by the list of independent variables. For example, type the command

regress price mpg rep78 weight length foreign then enter.

correlate price mpg

(obs=74)

| price mpg

-------------+------------------

price | 1.0000

mpg | -0.4686 1.0000

correlate price mpg rep78 weight length foreign

(obs=69)

| price mpg rep78 weight length foreign

-------------+------------------------------------------------------

price | 1.0000
mpg | -0.4559 1.0000

rep78 | 0.0066 0.4023 1.0000

weight | 0.5478 -0.8055 -0.4003 1.0000

length | 0.4425 -0.8037 -0.3606 0.9478 1.0000

foreign | -0.0174 0.4538 0.5922 -0.6460 -0.6110 1.0000

regress price mpg rep78 weight length foreign

Source | SS df MS Number of obs = 69

-------------+------------------------------ F( 5, 63) = 15.90

Model | 321789308 5 64357861.7 Prob > F = 0.0000

Residual | 255007650 63 4047740.48 R-squared = 0.5579

-------------+------------------------------ Adj R-squared = 0.5228

Total | 576796959 68 8482308.22 Root MSE = 2011.9

------------------------------------------------------------------------------

price | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

mpg | -26.01325 75.48927 -0.34 0.732 -176.8665 124.84


rep78 | 244.4242 318.787 0.77 0.446 -392.6208 881.4691

weight | 6.006738 1.03725 5.79 0.000 3.93396 8.079516

length | -102.2199 34.74826 -2.94 0.005 -171.6587 -32.78102

foreign | 3303.213 813.5921 4.06 0.000 1677.379 4929.047

_cons | 5896.438 5390.534 1.09 0.278 -4875.684 16668.56

Covariance matrix of coefficients of regress model

e(V) | mpg rep78 weight length foreign _cons

-------------+------------------------------------------------------------------------

mpg | 5698.6301

rep78 | -6545.3892 101625.14

weight | 19.667013 .94772928 1.0758867

length | 630.02684 -1456.3491 -28.839384 1207.4416

foreign | 16171.29 -133572.57 209.84955 2564.5577 661932.17

_cons | -282211.05 105230.53 1682.2409 -149140.82 -1209971.1 29057853

Having got the regression results, we may also wish to obtain the variance-covariance matrix for

the regression model. The command to get the variance-covariance matrix for the regression
model is to type: vce in the command window then press enter. The variance-covariance matrix

derives its name from the fact that the elements along the main diagonal are called VARIANCES

whereas the elements away from the main diagonal are called COVARIANCES.

The first or top part of the regression model is called the ANOVA table. The ANOVA table

shows SOURCE (model, residual and total); SUM OF SQUARES, SS; DEGREES OF

FREEDOM, df AND MEAN SUM OF SQUARES, MS.

The lower table provides the regression coefficients, the standard errors, the t statistics, the

probability value and the confidence intervals.

The sum of squares for the model is 312,789,308. This is also known as the explained sum of

squares (ESS). The sum of squares for the residual is 255,007,650 otherwise known as residual

sum of squares (RSS). The total sum of squares (TSS) is 576,796,959. Notice that: 312,789,308

+ 255,007,650 = 576,796,959. Hence, ESS + RSS = TSS.

The degrees of freedom for the model are 5. The formula for this is k – 1 where k is the number

of variables being estimated. Hence, k – 1 = 6 – 1 = 5. The degrees of freedom for the residual

are 63. The formula for this is n – k where n is the number of observations, and k is defined as

before. Hence, n – k = 69 – 6 = 63. The total degrees of freedom are 68. The formula for this is n

– 1. Hence, n – 1 = 69 – 1 = 68. Alternatively, 5 + 63 = 68.

Mean square is defined as the ratio of sum of squares to degrees of freedom. That is: MS =

SS/df. The mean square for the model is therefore 321,789,308/5 = 64,357,861.7; the mean

square for the residual is 255,007,650/63 = 4,047,740.48.


The model has a total of 69 observations. The probability value for the model is reported as Prob

> F = 0.0000. This means that the model is statistically significant at 1 percent level. The lower

the Prob value, the higher is the level of significance.

Th goodness of fit (R squared) of the model is reported as 0.5579. Now R Squared is the ratio of

explained sum of squares (ESS) to the total sum of squares (TSS). Thus, R squared = ESS/TSS =

321,789,308/576,796,959 = 0.5579. Thus mpg, rep78, weight, length and foreign explain or

account for 55.79 percent of all the variations in price, holding other factors constant.

Adjusted R squared is reported as 0.5228 which means that mpg, rep78, weight, length and

foreign explain or account for 55.79 percent of all the variations in price, holding other factors

constant when degrees of freedom are taken into account.

The formula for adjusted R squared is: Adj R Squared = 1 – (1 – R 2)*[(n – 1) / (n – k)]. Thus,

Adj R Squared = 1 – (1 – 0.5579)*[(69 – 1) / (69 – 6)] = 0.5228.

Root MSE is the root mean square error = 2011.9; is the square-root of mean square of the

residual. Thus, Root MSE = √ 4,047,740 = 2011.9

A coefficient measures how a unit change in a certain explanatory variable will affect the

dependent variable, holding all other factors constant. For example, the coefficient of mpg is –

26.01325. This means that if the mpg of a car increases by one unit, then price of the car will

decrease by 26.01 units, holding all other factors constant. The rest of the coefficients are

interpreted in a similar way.


The second column provides the standard errors (Std. Err.) for each coefficient in the regression

model. Standard errors are the square-root of variance. The variances are obtained from the

variance covariance matrix. Check to see whether the square root of the values on the main

diagonal of the variance-covariance matrix provide the standard errors that have been reported

for each variable.

The third column provides the t statistics. The t value is the ratio of coefficient to standard error.

That is, t = coefficient / std. err. For example, the t value for mpg = -26.01325 / 75.48927 = -

0.34, and so on for the remaining t values.

The next column is the probability value (P > |t|). The probability values help in determining the

significance of the coefficients. for example, if p < 0.01, it means that the coefficient is

significant at 1 percent level of significance; if p < 0.05, it means that the coefficient is

significant at 5 percent level of significance. If p > 0.10, the coefficient is not significant.

Stata Resources

The following are the resources that are useful to perform data analysis using Stata:

(i) Getting Started with Stata (GSW)

(ii) Stata Users Guide (U)

(iii) Stata Base Reference Manual (R)

(iv) Stata Data Management Reference Manual (G)

(v) Stata Programming Reference Manual (P)

(vi) Stata Time Series Reference Manual (TS)

(vii) Stata Quick Reference and Index (I)


(viii) Stata Website – www.stata.com

(ix) Stata demonstration videos on you tube.

You might also like