0% found this document useful (0 votes)
21 views6 pages

Dofile - Quan Ly Va Lam Sach Du Lieu 2

This document details the process of cleaning and preparing education data for analysis. It describes importing raw data from files, identifying and cleaning variables, adding new derived variables, handling missing data, merging additional files, and saving the cleaned dataset. The cleaning involves renaming, recoding, labeling and transforming various variables to make the data more usable and consistent.

Uploaded by

trannhi2806tg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views6 pages

Dofile - Quan Ly Va Lam Sach Du Lieu 2

This document details the process of cleaning and preparing education data for analysis. It describes importing raw data from files, identifying and cleaning variables, adding new derived variables, handling missing data, merging additional files, and saving the cleaned dataset. The cleaning involves renaming, recoding, labeling and transforming various variables to make the data more usable and consistent.

Uploaded by

trannhi2806tg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

data_cleaningmanagement.

p2 - Printed on 25-Sep-23 1:59:50 AM


1 ***************Cleaning Management.P2********************************
2 **********************************************************************
3 * clear, directory
4 **********************************************************************
5
6
7 *clear
8 clear all
9 set more off
10 macro drop _all
11
12 *directory
13 cd "/Users/..../data_management"
14
15 // globals
16 global raw "data/raw"
17 global clean "data/clean"
18 global analysis "data/analysis"
19 global logs "logs"
20
21
22 **********************************************************************
23 * open a log file
24 **********************************************************************
25
26 *open log file
27 *log using "$logs\load_clean_ipeds", replace
28
29
30 **********************************************************************
31 * input raw dataset
32 **********************************************************************
33
34 *load data from ipeds
35 import delimited using "$raw/us_data.csv"
36
37 *take a look at the data
38 browse
39
40
41 **********************************************************************
42 * identify and clean up the variables i want to work with
43 **********************************************************************
44
45 *rename variables
46 rename sectorofinstitutionhd2015 sector
47 rename carnegieclassification2015underg classification
48 rename fulltimeundergraduateenrollmentd ugrad_enrl_ft
49 rename undergraduateenrollmentdrvef2015 ugrad_enrl
50 rename percentadmittedtotaldrvadm2015 pct_admitted
51
52 *change the order
53 order unitid institutionname sector classification ugrad_enrl ugrad_enrl_ft pct_admitted
54
55
56 **********************************************************************
57 * take a quick look at sector
58 ***************************************************************************
59
60 *cross-tab. levels of sector
61 tab sector
62 tab sector, m // it's a good habit to ALWAYS be thinking about missing values. here there are none.
63

Page 1
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:50 AM
64 * ipeds also included a reference that indicated what these values for sector indicate
65
66 /*
67 Sector of institution (HD2015) 1 Public, 4-year or above
68 Sector of institution (HD2015) 2 Private not-for-profit, 4-year or above
69 Sector of institution (HD2015) 3 Private for-profit, 4-year or above
70 Sector of institution (HD2015) 4 Public, 2-year
71 Sector of institution (HD2015) 5 Private not-for-profit, 2-year
72 Sector of institution (HD2015) 6 Private for-profit, 2-year
73 Sector of institution (HD2015) 7 Public, less-than 2-year
74 Sector of institution (HD2015) 8 Private not-for-profit, less-than 2-year
75 Sector of institution (HD2015) 9 Private for-profit, less-than 2-year
76 */
77
78
79 **********************************************************************
80 * what we want to derive from the sector variable
81 * 1. a way to identify 4-year or not 4-year
82 * 2. a way to identify public
83 **********************************************************************
84
85
86 *identify four-year colleges
87 gen four_year = 0
88 replace four_year = 1 if sector == 1
89 replace four_year = 1 if sector == 2
90 replace four_year = 1 if sector == 3
91
92
93 *identify public colleges (more elegant)
94 gen public = 0
95 replace public = 1 if inlist(sector, 1, 4)
96
97 ***** most elegant approach *****
98 drop four_year public
99
100 *identify four-year coleges
101 gen four_year = inrange(sector, 1,3) // all values outside range, including missing, become zero.
102 // in this dataset, there are no missing values for sector
103
104 *identify public colleges
105 gen public = inlist(sector, 1,4) // all values other than 1 and 4, including missing, become zero.
106 // in this dataset, there are no missing values for sector
107
108 *cross-tab
109 tab sector four_year
110 tab sector public
111
112 **cross-tab. before, we had 1284 Private not-for-profit, 4-year or above (category 2)
113 tab public four_year
114
115 *drop the for-profit colleges.
116 drop if sector == 3
117
118 **cross-tab. do i get 1284 for not public, four-year?
119 tab public four_year // yes
120
121
122 ****************************************************************************************
123 * label values of variables
124 * 1. four-year
125 * 2. public
126 ****************************************************************************************

Page 2
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
127
128
129 *label four-year
130 label define four_year_label 0 "2-Year" 1 "4-Year"
131 label values four_year four_year_label
132
133
134 *did it work?
135 tab four_year
136
137
138 *label public
139 label define public 0 "Private" 1 "Public"
140 label values public public
141
142
143 *did it work?
144 tab public
145
146
147 *************************************************************************************
148 * inspect pct_admitted variable
149 *************************************************************************************
150
151 *summarize
152 su pct_admitted, d
153
154 *look at "obs." what does this tell you?
155 di _N
156
157 *generate indicator variable to investigate missing patterns more carefully
158 gen miss_admit = mi(pct_admitted)
159
160 *cross-tab and compute the percentage of miss_admit by liberal arts
161 tab miss_admit liberal_arts, col // 91.5% non-missing for lib arts.
162 // 51% non-missing for non-liberal arts.
163
164 *how about institions with "community college" in their name?
165 su pct_admitted if regexm(institutionname, "Community College") // only 5 report data!
166
167
168 * look at missing rate using the more granular classification system
169 tab classification miss_admit , m
170 tab classification miss_admit , row
171
172 *based on the description, it seems that most community colleges will be in 1-4.
173 tab classification if regexm(institutionname, "Community College")
174
175
176 *********************************************************************
177 * compute % of students who are full-time
178 *********************************************************************
179
180 *generate new variable
181 gen ft_pct = ugrad_enrl_ft / ugrad_enrl
182
183 *summary stats
184 su ft_pct, d
185
186 *histogram
187 hist ft_pct, d
188
189

Page 3
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
190 *********************************************************************
191 * re-order variables
192 *********************************************************************
193
194 *maybe we prefer this order
195 order unitid institutionname ugrad_enrl_ft ft_pct four_year public pct_admitted liberal_arts
ugrad_enrl
196
197 * move liberal arts to the spot after institutionname
198 order liberal_arts, after(institutionname)
199
200
201
202 save "$clean/us_college_data.dta", replace
203
204
205
206 *********************************************************************
207 * now, we decide we want to include more data about these colleges.
208 * perhaps we want to know more about variability by region.
209 * go back to ipeds, repeat the same steps for identifying the sample, and
210 * get a count of the number of students enrolling in each college by "home state."
211 * this time, for some reason, the data are in an excel spreadsheet, not a .csv file.
212 ********************************************************************
213
214 *clear the clean dataset from stata
215 clear
216
217 *load the new dataset
218 import excel using "$raw/us_data_stud_state.xlsx"
219
220
221
222 *clear it out and try again
223 clear
224
225 *tell stata that the first row has variable names
226 import excel using "$raw/us_data_stud_state.xlsx", firstrow
227
228
229
230 *get rid of the "EF2014C_RV" prefix. keep the rest of the variable name
231 rename EF2014C_RV* *
232
233 * prepare data for merge
234 * the key variable is the identifier--in this case, unitid
235 * 2 things to note
236 * 1. the VARIABLE NAME must match EXACTLY.
237 * 2. the VARIABLE CONTENTS must match EXACTLY.
238
239 *the other dataset, the "master" dataset, has a variable called "unitid"
240 *this dataset, the "using" dataset, has a variable called UnitID
241 *must rename!
242
243 *rename
244 rename UnitID unitid
245
246 *for a merge like this, we can't have duplicates on the merging variable (unitid)
247 *let's double-check
248 duplicates report unitid
249 *duplicates report State
250
251 *a manual way to ask stata to complain about uniqueness

Page 4
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
252 bys State: assert (_n==1)
253 bys unitid: assert (_n==1)
254
255 *save the "using" dataset
256 save "$clean/college_stud_state.dta", replace
257
258
259 ********************************************************************************
260 * load the master dataset and merge in the new data
261 ********************************************************************************
262
263 *master
264 use "$clean/us_college_data.dta", clear // one-step approach for clearing and loading
265
266 *merge
267 merge 1:1 unitid using "$clean/college_stud_state.dta"
268
269 *drop colleges from the "using" dataset that weren't in the master dataset
270 drop if _merge != 3
271
272 *get rid of the auto-generated _merge variable
273 drop _merge
274
275 *drop Institution Name. it was in both datasets with different variable names
276 drop InstitutionName
277
278 **********************************************************************
279 * Loop
280 * 1. change the crazy looking variable labels for the state variables to something more sensible
281 * using a loop
282 * 2. save the new file to "analysis" folder, replacing the old version of the analysis file
283 *********************************************************************
284
285 *change the variable labels for the state variables
286
287 label var Alabama "Alabama Enrollees"
288 label var Alaska "Alaska Enrollees"
289
290
291 *type "help loop" into the command window
292 *use the loop struture below to rename ALL the state variables following the pattern above.
293 *you'll only have to put one line of code into the loop
294
295
296 *loop. note that we can take advantage of the order of the variables using "-"
297 foreach x of varlist Arizona-Wyoming {
298 label var `x' "`x'"
299 }
300
301 *** make the variable names and contents lowercase
302 *rename variables
303 rename Alabama-Wyoming, lower
304
305
306 *replace values of state with lower-case values
307 replace State = lower(State)
308
309
310 ***get rid of spaces in the state variable
311 *note that State still has spaces, but variable names do not (and cannot)
312 replace State = subinstr(State, " ", "", .)
313 tab State
314

Page 5
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
315
316 ***note that missing is equivalent to zero for alabama-wyoming
317 *this seems like a good thing to fix with a loop.
318 *before going for the loop, it's helpful to do a few cases without looping
319 replace alabama = 0 if alabama == .
320 replace alaska = 0 if alaska == .
321
322
323
324 foreach s of varlist arizona-wyoming {
325 replace `s' = 0 if `s' == .
326 }
327
328 ***break down hard problems into easier problems
329
330 *alabama
331 gen out_of_state = .
332 *replace out_of_state = alaska + arizona + ... // hmmm, seems hard
333 drop out_of_state
334
335
336 gen in_state = . // maybe this will be easier. then we could subtract in-state from "UStotal" to
get out of state.
337 replace in_state = alabama if State == "alabama"
338 replace in_state = alaska if State == "alaska"
339
340 *log close
341
342
343
344
345
346
347
348

Page 6

You might also like