Dofile - Quan Ly Va Lam Sach Du Lieu 2

This document details the process of cleaning and preparing education data for analysis. It describes importing raw data from files, identifying and cleaning variables, adding new derived variables, handling missing data, merging additional files, and saving the cleaned dataset. The cleaning involves renaming, recoding, labeling and transforming various variables to make the data more usable and consistent.

Uploaded by

trannhi2806tg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views6 pages

Dofile - Quan Ly Va Lam Sach Du Lieu 2

Uploaded by

trannhi2806tg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

data_cleaningmanagement.

p2 - Printed on 25-Sep-23 1:59:50 AM

1 ***************Cleaning Management.P2********************************
2 **********************************************************************
3 * clear, directory
4 **********************************************************************
5
6
7 *clear
8 clear all
9 set more off
10 macro drop _all
11
12 *directory
13 cd "/Users/..../data_management"
14
15 // globals
16 global raw "data/raw"
17 global clean "data/clean"
18 global analysis "data/analysis"
19 global logs "logs"
20
21
22 **********************************************************************
23 * open a log file
24 **********************************************************************
25
26 *open log file
27 *log using "$logs\load_clean_ipeds", replace
28
29
30 **********************************************************************
31 * input raw dataset
32 **********************************************************************
33
34 *load data from ipeds
35 import delimited using "$raw/us_data.csv"
36
37 *take a look at the data
38 browse
39
40
41 **********************************************************************
42 * identify and clean up the variables i want to work with
43 **********************************************************************
44
45 *rename variables
46 rename sectorofinstitutionhd2015 sector
47 rename carnegieclassification2015underg classification
48 rename fulltimeundergraduateenrollmentd ugrad_enrl_ft
49 rename undergraduateenrollmentdrvef2015 ugrad_enrl
50 rename percentadmittedtotaldrvadm2015 pct_admitted
51
52 *change the order
53 order unitid institutionname sector classification ugrad_enrl ugrad_enrl_ft pct_admitted
54
55
56 **********************************************************************
57 * take a quick look at sector
58 ***************************************************************************
59
60 *cross-tab. levels of sector
61 tab sector
62 tab sector, m // it's a good habit to ALWAYS be thinking about missing values. here there are none.
63

Page 1
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:50 AM
64 * ipeds also included a reference that indicated what these values for sector indicate
65
66 /*
67 Sector of institution (HD2015) 1 Public, 4-year or above
68 Sector of institution (HD2015) 2 Private not-for-profit, 4-year or above
69 Sector of institution (HD2015) 3 Private for-profit, 4-year or above
70 Sector of institution (HD2015) 4 Public, 2-year
71 Sector of institution (HD2015) 5 Private not-for-profit, 2-year
72 Sector of institution (HD2015) 6 Private for-profit, 2-year
73 Sector of institution (HD2015) 7 Public, less-than 2-year
74 Sector of institution (HD2015) 8 Private not-for-profit, less-than 2-year
75 Sector of institution (HD2015) 9 Private for-profit, less-than 2-year
76 */
77
78
79 **********************************************************************
80 * what we want to derive from the sector variable
81 * 1. a way to identify 4-year or not 4-year
82 * 2. a way to identify public
83 **********************************************************************
84
85
86 *identify four-year colleges
87 gen four_year = 0
88 replace four_year = 1 if sector == 1
89 replace four_year = 1 if sector == 2
90 replace four_year = 1 if sector == 3
91
92
93 *identify public colleges (more elegant)
94 gen public = 0
95 replace public = 1 if inlist(sector, 1, 4)
96
97 ***** most elegant approach *****
98 drop four_year public
99
100 *identify four-year coleges
101 gen four_year = inrange(sector, 1,3) // all values outside range, including missing, become zero.
102 // in this dataset, there are no missing values for sector
103
104 *identify public colleges
105 gen public = inlist(sector, 1,4) // all values other than 1 and 4, including missing, become zero.
106 // in this dataset, there are no missing values for sector
107
108 *cross-tab
109 tab sector four_year
110 tab sector public
111
112 **cross-tab. before, we had 1284 Private not-for-profit, 4-year or above (category 2)
113 tab public four_year
114
115 *drop the for-profit colleges.
116 drop if sector == 3
117
118 **cross-tab. do i get 1284 for not public, four-year?
119 tab public four_year // yes
120
121
122 ****************************************************************************************
123 * label values of variables
124 * 1. four-year
125 * 2. public
126 ****************************************************************************************

Page 2
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
127
128
129 *label four-year
130 label define four_year_label 0 "2-Year" 1 "4-Year"
131 label values four_year four_year_label
132
133
134 *did it work?
135 tab four_year
136
137
138 *label public
139 label define public 0 "Private" 1 "Public"
140 label values public public
141
142
143 *did it work?
144 tab public
145
146
147 *************************************************************************************
148 * inspect pct_admitted variable
149 *************************************************************************************
150
151 *summarize
152 su pct_admitted, d
153
154 *look at "obs." what does this tell you?
155 di _N
156
157 *generate indicator variable to investigate missing patterns more carefully
158 gen miss_admit = mi(pct_admitted)
159
160 *cross-tab and compute the percentage of miss_admit by liberal arts
161 tab miss_admit liberal_arts, col // 91.5% non-missing for lib arts.
162 // 51% non-missing for non-liberal arts.
163
164 *how about institions with "community college" in their name?
165 su pct_admitted if regexm(institutionname, "Community College") // only 5 report data!
166
167
168 * look at missing rate using the more granular classification system
169 tab classification miss_admit , m
170 tab classification miss_admit , row
171
172 *based on the description, it seems that most community colleges will be in 1-4.
173 tab classification if regexm(institutionname, "Community College")
174
175
176 *********************************************************************
177 * compute % of students who are full-time
178 *********************************************************************
179
180 *generate new variable
181 gen ft_pct = ugrad_enrl_ft / ugrad_enrl
182
183 *summary stats
184 su ft_pct, d
185
186 *histogram
187 hist ft_pct, d
188
189

Page 3
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
190 *********************************************************************
191 * re-order variables
192 *********************************************************************
193
194 *maybe we prefer this order
195 order unitid institutionname ugrad_enrl_ft ft_pct four_year public pct_admitted liberal_arts
ugrad_enrl
196
197 * move liberal arts to the spot after institutionname
198 order liberal_arts, after(institutionname)
199
200
201
202 save "$clean/us_college_data.dta", replace
203
204
205
206 *********************************************************************
207 * now, we decide we want to include more data about these colleges.
208 * perhaps we want to know more about variability by region.
209 * go back to ipeds, repeat the same steps for identifying the sample, and
210 * get a count of the number of students enrolling in each college by "home state."
211 * this time, for some reason, the data are in an excel spreadsheet, not a .csv file.
212 ********************************************************************
213
214 *clear the clean dataset from stata
215 clear
216
217 *load the new dataset
218 import excel using "$raw/us_data_stud_state.xlsx"
219
220
221
222 *clear it out and try again
223 clear
224
225 *tell stata that the first row has variable names
226 import excel using "$raw/us_data_stud_state.xlsx", firstrow
227
228
229
230 *get rid of the "EF2014C_RV" prefix. keep the rest of the variable name
231 rename EF2014C_RV* *
232
233 * prepare data for merge
234 * the key variable is the identifier--in this case, unitid
235 * 2 things to note
236 * 1. the VARIABLE NAME must match EXACTLY.
237 * 2. the VARIABLE CONTENTS must match EXACTLY.
238
239 *the other dataset, the "master" dataset, has a variable called "unitid"
240 *this dataset, the "using" dataset, has a variable called UnitID
241 *must rename!
242
243 *rename
244 rename UnitID unitid
245
246 *for a merge like this, we can't have duplicates on the merging variable (unitid)
247 *let's double-check
248 duplicates report unitid
249 *duplicates report State
250
251 *a manual way to ask stata to complain about uniqueness

Page 4
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
252 bys State: assert (_n==1)
253 bys unitid: assert (_n==1)
254
255 *save the "using" dataset
256 save "$clean/college_stud_state.dta", replace
257
258
259 ********************************************************************************
260 * load the master dataset and merge in the new data
261 ********************************************************************************
262
263 *master
264 use "$clean/us_college_data.dta", clear // one-step approach for clearing and loading
265
266 *merge
267 merge 1:1 unitid using "$clean/college_stud_state.dta"
268
269 *drop colleges from the "using" dataset that weren't in the master dataset
270 drop if _merge != 3
271
272 *get rid of the auto-generated _merge variable
273 drop _merge
274
275 *drop Institution Name. it was in both datasets with different variable names
276 drop InstitutionName
277
278 **********************************************************************
279 * Loop
280 * 1. change the crazy looking variable labels for the state variables to something more sensible
281 * using a loop
282 * 2. save the new file to "analysis" folder, replacing the old version of the analysis file
283 *********************************************************************
284
285 *change the variable labels for the state variables
286
287 label var Alabama "Alabama Enrollees"
288 label var Alaska "Alaska Enrollees"
289
290
291 *type "help loop" into the command window
292 *use the loop struture below to rename ALL the state variables following the pattern above.
293 *you'll only have to put one line of code into the loop
294
295
296 *loop. note that we can take advantage of the order of the variables using "-"
297 foreach x of varlist Arizona-Wyoming {
298 label var `x' "`x'"
299 }
300
301 *** make the variable names and contents lowercase
302 *rename variables
303 rename Alabama-Wyoming, lower
304
305
306 *replace values of state with lower-case values
307 replace State = lower(State)
308
309
310 ***get rid of spaces in the state variable
311 *note that State still has spaces, but variable names do not (and cannot)
312 replace State = subinstr(State, " ", "", .)
313 tab State
314

Page 5
data_cleaningmanagement.p2 - Printed on 25-Sep-23 1:59:51 AM
315
316 ***note that missing is equivalent to zero for alabama-wyoming
317 *this seems like a good thing to fix with a loop.
318 *before going for the loop, it's helpful to do a few cases without looping
319 replace alabama = 0 if alabama == .
320 replace alaska = 0 if alaska == .
321
322
323
324 foreach s of varlist arizona-wyoming {
325 replace `s' = 0 if `s' == .
326 }
327
328 ***break down hard problems into easier problems
329
330 *alabama
331 gen out_of_state = .
332 *replace out_of_state = alaska + arizona + ... // hmmm, seems hard
333 drop out_of_state
334
335
336 gen in_state = . // maybe this will be easier. then we could subtract in-state from "UStotal" to
get out of state.
337 replace in_state = alabama if State == "alabama"
338 replace in_state = alaska if State == "alaska"
339
340 *log close
341
342
343
344
345
346
347
348

Page 6

Do - File - Quan Ly Va Lam Sach Du Lieu
No ratings yet
Do - File - Quan Ly Va Lam Sach Du Lieu
6 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Membership Constraints: Adel Nehme
No ratings yet
Membership Constraints: Adel Nehme
36 pages
Preprocessing Code
No ratings yet
Preprocessing Code
11 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Science Practicals
No ratings yet
Data Science Practicals
40 pages
Computer Science Investigatory Project Analysis of Student Performance Using Pandas, Matplotlib, and SQL
No ratings yet
Computer Science Investigatory Project Analysis of Student Performance Using Pandas, Matplotlib, and SQL
11 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Import Import As Import As: #Default To CSV
No ratings yet
Import Import As Import As: #Default To CSV
6 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Data Analysis by Using Python
No ratings yet
Data Analysis by Using Python
15 pages
4mission 493 Dataframes in R Takeaways
No ratings yet
4mission 493 Dataframes in R Takeaways
3 pages
Dirty Data. Clean It Using SAS
No ratings yet
Dirty Data. Clean It Using SAS
40 pages
Data Cleaning R
No ratings yet
Data Cleaning R
16 pages
Project Sparta Data Science Pathway Curriculum
No ratings yet
Project Sparta Data Science Pathway Curriculum
61 pages
CLO4 Review Data Analytics
No ratings yet
CLO4 Review Data Analytics
11 pages
BI Practical Journal Final-1
No ratings yet
BI Practical Journal Final-1
53 pages
IDS Syllabus
No ratings yet
IDS Syllabus
3 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Notebook 1 - Basic R & Data Exploration
No ratings yet
Notebook 1 - Basic R & Data Exploration
19 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
5 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Unit 2
No ratings yet
Unit 2
76 pages
Data Science Journal
No ratings yet
Data Science Journal
40 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
Data Analytics Using R Lab - Master Manual
No ratings yet
Data Analytics Using R Lab - Master Manual
29 pages
Data Preparation: Treatment of Missing Values
No ratings yet
Data Preparation: Treatment of Missing Values
26 pages
Data Cleaning Checklist & AI Prompts (40 Prompts)
No ratings yet
Data Cleaning Checklist & AI Prompts (40 Prompts)
10 pages
01 Sample
No ratings yet
01 Sample
3 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Practical 1 EDA
No ratings yet
Practical 1 EDA
14 pages
Data Preparation: Handling Missing Values and Outliers
No ratings yet
Data Preparation: Handling Missing Values and Outliers
28 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Notebook 1 - Basic R & Data Exploration
No ratings yet
Notebook 1 - Basic R & Data Exploration
19 pages
How To Do Reliability Analysis and Basic Factor Analysis in R
No ratings yet
How To Do Reliability Analysis and Basic Factor Analysis in R
4 pages
Lecture 19
No ratings yet
Lecture 19
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
Using Excel To Clean and Prepare Data For Analysis
No ratings yet
Using Excel To Clean and Prepare Data For Analysis
9 pages
Excel For Data Analysis
No ratings yet
Excel For Data Analysis
9 pages
Using Excel To Clean and Prepare Data
No ratings yet
Using Excel To Clean and Prepare Data
9 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
Da (22C01156)
No ratings yet
Da (22C01156)
26 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
03 Numpy and Pandas
No ratings yet
03 Numpy and Pandas
68 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
ENGG1003 Lab06 DataScience
No ratings yet
ENGG1003 Lab06 DataScience
39 pages
R Sharing
No ratings yet
R Sharing
16 pages
BI Journal KC
No ratings yet
BI Journal KC
38 pages
DAS - Y20-21 T1 - G Section - B
No ratings yet
DAS - Y20-21 T1 - G Section - B
5 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Final Exam Sample: International University, Vietnam National University - HCMC
No ratings yet
Final Exam Sample: International University, Vietnam National University - HCMC
4 pages
Unit I
No ratings yet
Unit I
57 pages
21 Series - CSE Curriculum - V SEMESTER
No ratings yet
21 Series - CSE Curriculum - V SEMESTER
25 pages
Cnmaestro 3.0.0 Release Notes
No ratings yet
Cnmaestro 3.0.0 Release Notes
19 pages
Ascii Code
No ratings yet
Ascii Code
4 pages
Stores List
No ratings yet
Stores List
2 pages
XTP II Systems G4 Updated
No ratings yet
XTP II Systems G4 Updated
24 pages
(MPCE Note - Session 1) Maritime Peplink Certified Engineer - Selecting The Equipment
No ratings yet
(MPCE Note - Session 1) Maritime Peplink Certified Engineer - Selecting The Equipment
56 pages
Coal Feeder Upgrades: TH RD RD
No ratings yet
Coal Feeder Upgrades: TH RD RD
2 pages
Microsoft Word - KCA014
No ratings yet
Microsoft Word - KCA014
1 page
Students Attendance Management System Report
No ratings yet
Students Attendance Management System Report
68 pages
Cix Practice Paper Ai 5
No ratings yet
Cix Practice Paper Ai 5
6 pages
Minor Project Project Format 2024
No ratings yet
Minor Project Project Format 2024
5 pages
Week # 1
No ratings yet
Week # 1
2 pages
Manual: Centralized Monitoring Management Platform
No ratings yet
Manual: Centralized Monitoring Management Platform
49 pages
Resume Utkarsh Saxena
No ratings yet
Resume Utkarsh Saxena
2 pages
Detecting Network Threats Using OSINT Knowledge-Based IDS: 2018 14th European Dependable Computing Conference
No ratings yet
Detecting Network Threats Using OSINT Knowledge-Based IDS: 2018 14th European Dependable Computing Conference
8 pages
Iso-Iec 9075-5
No ratings yet
Iso-Iec 9075-5
261 pages
Municipality Tax Management System Project Report
No ratings yet
Municipality Tax Management System Project Report
99 pages
15 Span-Rspan
No ratings yet
15 Span-Rspan
12 pages
Release Notes Omnistack 6200
No ratings yet
Release Notes Omnistack 6200
28 pages
Yuzu
No ratings yet
Yuzu
4 pages
User Manual 3653788 - PG - 103
No ratings yet
User Manual 3653788 - PG - 103
21 pages
DSA - Lab4 Sorting Techniques2
No ratings yet
DSA - Lab4 Sorting Techniques2
7 pages
Sirion Labs Interview Questions Set-1
No ratings yet
Sirion Labs Interview Questions Set-1
4 pages
3.multi Processor Os
No ratings yet
3.multi Processor Os
6 pages
Reading 8°
No ratings yet
Reading 8°
2 pages
Fortran Lecture 5
No ratings yet
Fortran Lecture 5
35 pages
Subject Preference List 2022-23 EVEN
No ratings yet
Subject Preference List 2022-23 EVEN
3 pages
Real-Time Motion Amplification On Mobile Devices: Hv28@cornell - Edu
No ratings yet
Real-Time Motion Amplification On Mobile Devices: Hv28@cornell - Edu
12 pages
Howto: Openvpn Books
No ratings yet
Howto: Openvpn Books
48 pages
Network-Based Intrusion Detection With Support Vector Machines
No ratings yet
Network-Based Intrusion Detection With Support Vector Machines
14 pages

Dofile - Quan Ly Va Lam Sach Du Lieu 2

Uploaded by

Dofile - Quan Ly Va Lam Sach Du Lieu 2

Uploaded by

data_cleaningmanagement.

p2 - Printed on 25-Sep-23 1:59:50 AM

You might also like