Data analysis using stata
Data analysis using stata
Before starting to work with STATA, ensure you have the data that you want to work with,
For example, the STATA folder has an excel file named: Data on GPA, TUCE, PSI and
To start STATA, click on the STATA folder provided, then double click the application.
This will open the STATA interface, and you will notice that STATA has four windows as
follows:
To start the process of data analysis, click as follows: file – log – begin. Then stata will ask you
Now, minimize the stata application, then open the excel file containing the data on: gpa, tuce,
psi and grade. Copy this data from excel (you can close the excel file after copying), then
maximize the stata application. In the command window, type: edit then press enter. This will
bring the stata spreadsheet. Now, you can paste your data here (in the cell highlighted with blue
– the cell on the top left of your stata spreadsheet). You may now close the data editor. Notice in
the results window, the result is “6 variables and 32 observations have been pasted in to the data
editor. Also, when you check in the review window, you will see a history of all the commands
that you are working with, and this is good for replication purposes. Finally, the variables
window displays the variables that you are working with. Having pasted the data into the data
editor, now you are ready to begin the process of data analysis.
However, the makers of stata have also installed some example data sets into stata, to aid in
teaching and training. Therefore, instead of using our data on gpa, tuce, psi and grade, it would
be more ideal if we were to use the data that the makers of stata have already installed. To do
away with the data we have just entered, type clear in the command window, then press enter. If
you type a command in the command window, you always have to press ENTER so as to
One of the famous example data sets that have been installed into stata is the 1978 Automobile
data which shows data on various automobiles as at 1978 and their characteristics. To get the
1978 automobile data, type use auto in the command window, then enter. Now, check in your
variables window. You will see that the variables are: make, price, mpg, rep78, headroom, trunk,
weight, length, turn, displacement, gear ratio and foreign. Thus, we have 12 variables.
To view the data, type browse in the command window then press enter. You will be able to see
To describe the data, type describe in the command window then press enter. You will be able to
see a description of all your variables in the results window (the dark screen). Make is the make
and model of the car, price is the price of the car, mpg is mileage per gallon, rep78 is the repair
record as at 1978, headroom is headroom in inches, trunk is trunk space in cubic feet, weight in
pounds, length in inches, turn is the turn circle in feet, displacement is displacement in cubic
inches, gear_ratio is Gear Ratio and finally, foreign is a dummy or indicator variable for car type
From the output, we notice that on storage type, some variables are string variables (str), others
are integer variables (int), while others are float variables. A string variable means that the
variable is not numeric but is in words or alphabet. Thus make is str18 which means that the
longest name in the variable make has 18 characters. Price, mpg, rep78, trunk and displacement
are int, thus they are integers. Headroom and gear_ratio are float variables which means that
their values have decimal points. Foreign is a byte which means that it is a dummy variable or
indicator variable.
To get summary statistics for the data, type summarize in the command window, then enter. The
summary statistics show the number of observations, the mean, the standard deviation, and the
maximum and minimum values. You can even copy these statistics from stata and paste them
into your word project for interpretation of the results (you could review what you learnt in
-------------+--------------------------------------------------------
-------------+--------------------------------------------------------
-------------+--------------------------------------------------------
Source: Author
But, if you want to get more details about the summary statistics, type: summarize, detail in the
command window, and then enter. If you want summary statistics for only one variable with
STATA also allows the user to generate new variables from the data set provided. Thus, we can
- To create the square of a variable (say mpg), the command is: generate squarempg =
- To create the square root of a variable (say price), the command is: generate sqrootprice
- to create the natural logarithm of a variable (say headroom), the command is: generate
- to create the reciprocal of a variable (say mpg), the command is: generate
- In order to see your new variables, type browse in the command window, then enter.
Notice that the spreadsheet now contains the new variables and even in the variables
Graphics can also be done using stata. These include: scatter plots, line graph, bar graph, pie
- to create a scatter plot, between price and mpg, the command is: scatter price mpg then
enter
Figure 1: Scatter plot between price and mpg
Source: Author
- to create a line graph, between price and mpg, the command is: line price mpg then enter
- to create a bar graph, between price and mpg, the command is: graph bar price mpg
then enter
- to create a pie chart, between price and mpg, the command is: graph pie price mpg then
enter
- Repeat the above procedure but now using many variables rather than only two variables.
With stata, you can also perform correlation and regression analysis. For example to correlate
price and mpg, type correlate price mpg in the command window then enter. We notice that the
correlation coefficient between price and mpg is – 0.4686. There is a fair negative correlation
between price and mpg. Also, try: correlate price mpg rep78 weight length foreign then enter.
Stata also performs regression analysis, which is to find the effect of independent variables on
the dependent variable. In regression, the command is regress, followed by the dependent
variable, then followed by the list of independent variables. For example, type the command
(obs=74)
| price mpg
-------------+------------------
price | 1.0000
(obs=69)
-------------+------------------------------------------------------
price | 1.0000
mpg | -0.4559 1.0000
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
-------------+------------------------------------------------------------------------
mpg | 5698.6301
Having got the regression results, we may also wish to obtain the variance-covariance matrix for
the regression model. The command to get the variance-covariance matrix for the regression
model is to type: vce in the command window then press enter. The variance-covariance matrix
derives its name from the fact that the elements along the main diagonal are called VARIANCES
whereas the elements away from the main diagonal are called COVARIANCES.
The first or top part of the regression model is called the ANOVA table. The ANOVA table
shows SOURCE (model, residual and total); SUM OF SQUARES, SS; DEGREES OF
The lower table provides the regression coefficients, the standard errors, the t statistics, the
The sum of squares for the model is 312,789,308. This is also known as the explained sum of
squares (ESS). The sum of squares for the residual is 255,007,650 otherwise known as residual
sum of squares (RSS). The total sum of squares (TSS) is 576,796,959. Notice that: 312,789,308
The degrees of freedom for the model are 5. The formula for this is k – 1 where k is the number
of variables being estimated. Hence, k – 1 = 6 – 1 = 5. The degrees of freedom for the residual
are 63. The formula for this is n – k where n is the number of observations, and k is defined as
before. Hence, n – k = 69 – 6 = 63. The total degrees of freedom are 68. The formula for this is n
Mean square is defined as the ratio of sum of squares to degrees of freedom. That is: MS =
SS/df. The mean square for the model is therefore 321,789,308/5 = 64,357,861.7; the mean
> F = 0.0000. This means that the model is statistically significant at 1 percent level. The lower
Th goodness of fit (R squared) of the model is reported as 0.5579. Now R Squared is the ratio of
explained sum of squares (ESS) to the total sum of squares (TSS). Thus, R squared = ESS/TSS =
321,789,308/576,796,959 = 0.5579. Thus mpg, rep78, weight, length and foreign explain or
account for 55.79 percent of all the variations in price, holding other factors constant.
Adjusted R squared is reported as 0.5228 which means that mpg, rep78, weight, length and
foreign explain or account for 55.79 percent of all the variations in price, holding other factors
The formula for adjusted R squared is: Adj R Squared = 1 – (1 – R 2)*[(n – 1) / (n – k)]. Thus,
Root MSE is the root mean square error = 2011.9; is the square-root of mean square of the
A coefficient measures how a unit change in a certain explanatory variable will affect the
dependent variable, holding all other factors constant. For example, the coefficient of mpg is –
26.01325. This means that if the mpg of a car increases by one unit, then price of the car will
decrease by 26.01 units, holding all other factors constant. The rest of the coefficients are
model. Standard errors are the square-root of variance. The variances are obtained from the
variance covariance matrix. Check to see whether the square root of the values on the main
diagonal of the variance-covariance matrix provide the standard errors that have been reported
The third column provides the t statistics. The t value is the ratio of coefficient to standard error.
That is, t = coefficient / std. err. For example, the t value for mpg = -26.01325 / 75.48927 = -
The next column is the probability value (P > |t|). The probability values help in determining the
significance of the coefficients. for example, if p < 0.01, it means that the coefficient is
significant at 1 percent level of significance; if p < 0.05, it means that the coefficient is
significant at 5 percent level of significance. If p > 0.10, the coefficient is not significant.
Stata Resources
The following are the resources that are useful to perform data analysis using Stata: