Chapter 1. Data Preparation (2)
Chapter 1. Data Preparation (2)
DATA PREPARATION
INTERMEDIATE ECONOMETRICS & DATA ANALYSIS
CHAPTER 1. PLAN
3. Click on “New notebook” in the pop-up window. Alternatively, you can click “File” and then
“New notebook”.
3. Click on “Upload” in the pop-up window. Then select a file to upload it. Alternatively, you
can click “File” and then “Upload notebook”.
LIBRARIES
PART II
A. IMPORT LIBRARIES
(1/3)
• First, always ensure that you import the necessary libraries at the beginning.
Otherwise, you will need to write each line of the code manually.
• The abbreviation speeds up our coding by allowing us to avoid typing the full
name each time we need it.
A. IMPORT LIBRARIES
(2/3)
• Please note that you don’t need to import all libraries at once. You can start
with the ones you need and import additional libraries as you progress
through your code.
A. IMPORT LIBRARIES
(3/3)
• Sometimes, you may only need a specific tool from a library. In such cases,
you can import just that tool from the library using the following syntax:
• Lastly, all the libraries we will use are already available in Google Colab, so
no installation is required. You only need to import them to use them in your
code sheet.
• However, if you are curious about how to install a library, here is an example:
B. INSTALL LIBRARIES
(1/2)
• If you attempt to import a library named “lasio”, which is used to open a specific file
type (LAS):
• This error message indicates that the library does not exist and must be installed first.
B. INSTALL LIBRARIES
(2/2)
• In this case, use the following syntax to install the library, and then you can
import it:
• The library is now installed, and no error message appeared when running
the “import lasio” line of code.
DATA IMPORT
PART III
A. IMPORT DATA
(1/3)
• Now that we have the coding sheet and the necessary libraries, we can import our
data. This can be done in several ways, but one of the simplest methods is to:
1. Run the line : “from google.colab import files”;
• Please be aware that if third-party cookies are disabled in your web browser,
you will encounter the following error message:
A. IMPORT DATA
(3/3)
• To fix this error, you should:
1. Go to your browser settings;
• Importing the data refers to the process of loading data from external
sources, such as CSV or Excel files, into your Python environment.
• Reading the data, on the other hand, specifically means accessing the data
and loading it into memory so it can be used within your Python program.
• We also give the dataset a simple name like “df” (short for DataFrame) to
make referencing it easier, so we don’t have to write the full name each time.
• Next, we need to identify the type of file that contains our data, as the code
will vary accordingly. Here is the syntax for the two main file types we will be
using: CSV and Excel.
B. READ DATA
(3/3)
• Finally, we display the dataset that has been imported and read by Pandas.
You have three options for displaying the data:
1. Show all rows of the dataset.
• Let’s apply this to the “weight-height” dataset after importing the necessary
libraries and reading the data.
DATA PREPARATION
PART IV
DATA PREPARATION PROCESS
• Data collected from the real world is often imperfect and needs to be prepared
before any analysis. Typically, you will encounter issues such as errors, missing
values, outliers, or imbalanced datasets. These problems must be addressed before
you start analyzing the data.
• In summary, the following steps need to be followed: First, collect the data. Then,
identify impurities using methods such as descriptive statistics. Then, prepare the data
by addressing these impurities. Finally, analyze the cleaned data. For this process, you
will need two libraries: Pandas and NumPy (for computations such as the mean).
A. MISSING DATA
A. MISSING DATA
(1/6)
• For example, if a researcher focuses solely on the years when tax fraud
occurred and intentionally omits the years when it did not, this reflects
missingness that is directly related to unobserved values (the years when
fraud did not occur).
A. MISSING DATA
(3/6)
• In other words, the missing data is purely random or caused by external factors
unrelated to the dataset. For example:
◦ A survey respondent accidentally skips a question (e.g., “What is your favorite
movie?”);
• In some cases, missing data can be informative, as its absence may convey valuable
information, leading to context-aware insights and a deeper understanding. For
example, if a teacher records the number of students absent for each session and
nothing is recorded for a particular session, it may indicate that no students were
absent that day.
• Therefore, understanding these types (i.e., MNAR, MAR or MCAR) is essential for
choosing the right methods to handle missing data in your analysis. Common
approaches to handling missing data include removing the missing values, imputing
them, or using advanced modeling techniques designed to account for the missing data.
A. MISSING DATA
(6/6)
◦ Remove the observation: a row is deleted when it contains many missing values.
• The dataset we will use is as follows – open the “2- Missing Data” Python
script:
a. DETECT ALL MISSING VALUES
• To find the total number of missing values, use the following syntax:
b. DETECT MISSING VALUES PER VARIABLE
• Since we are using only the first “sum()”, we directly obtain the number of
missing values per variable (per column). Alternatively, using “sum(axis=0)”
will yield the same results.
c. DETECT MISSING VALUES PER
OBSERVATION
◦ Outliers;
◦ Influential points.
B. ABNORMAL DATA POINTS
(2/4)
INFLUENTIAL
EXTREME VALUES OUTLIERS POINTS
• X value: normal distance. • X value: abnormal distance.
• X value: abnormal distance.
• Y value: does not follow the • Y value: does not follow the
• Y value: follows the trend. trend. trend.
3. If the data point does not belong to our population Remove it: In a study on the
impact of a car’s engine power on its price, a Ferrari included in the sample may be
irrelevant if we are not interested in luxury cars.
B. ABNORMAL DATA POINTS
(4/4)
4. If the data point is part of our population but still differs significantly from other
points Monitor it closely and include it in the model.
5. Otherwise, try different models and report the results for each scenario:
◦ Summary statistics: Such as the first quartile (Q1), third quartile (Q3), and
interquartile range (IQR).
• We typically begin with graphs to gain a general overview of the dataset and
identify any potential abnormal point. After that, we can apply methods such
as the “interquartile range (IQR) rule” to establish thresholds beyond which
points may be considered problematic.
1. DETECT ABNORMAL POINTS
(2/5)
• Let’s now focus on the Height variable from the “weight-height” dataset.
1. DETECT ABNORMAL POINTS
(3/5)
• If there is no reason to delete a data point, you can mitigate its impact by
modifying its value. To do this, follow these steps:
1. Remove the abnormal points using the same steps as before.
◦ NumPy for computations such as the mean, median, Q1, Q3, IQR, etc.
• The dataset used for building the model and addressing the classification problem
includes:
◦ Several explanatory variables (qualitative and/or quantitative), referred to as “features”;
• In this context, one challenge we may encounter is dealing with imbalanced datasets.
C. IMBALANCED DATASETS
(2/3)
I
B M
A B
L A
A L
N A
C N
E C
D E
D
1. DETECT IMBALANCED CLASSES
(1/2)
• To determine if a dataset is balanced or not, follow these steps:
1. DETECT IMBALANCED CLASSES
(2/2)
2. RANDOM UNDER-SAMPLING (RUS)
(1/8)
• The aim is to create a more balanced class distribution, which can enhance
the performance of machine learning models sensitive to class imbalance.
2. RANDOM UNDER-SAMPLING (RUS)
(2/8)
1. Determine the class distribution in your dataset to identify the majority and
minority classes.
2. Randomly select instances from the majority class. Then, remove the
randomly selected instances from the majority class until its size is
comparable to the minority class or reaches a desired proportion.
3. Combine the remaining instances of the majority class with all instances of
the minority class to form a new, balanced dataset.
2. RANDOM UNDER-SAMPLING (RUS)
(4/8)
2. RANDOM UNDER-SAMPLING (RUS)
(5/8)
2. RANDOM UNDER-SAMPLING (RUS)
(6/8)
2. RANDOM UNDER-SAMPLING (RUS)
(7/8)
• The dataset is highly imbalanced, and the loss of some cases from the
majority class won’t significantly impact the model’s ability to learn;
• The dataset is very large, and reducing its size can enhance the efficiency of
the training process.
2. RANDOM UNDER-SAMPLING (RUS)
(8/8)
• Loss of information: Removing cases from the majority class can lead to the
loss of potentially valuable information, which might negatively affect the
model’s performance.
• In the following example, the dataset is imbalanced with 1 “yes” and 5 “no”
cases. Over-sampling involves randomly adding 4 more “yes” instances. This
results in a balanced dataset with 5 “yes” and 5 “no” cases.
3. RANDOM OVER-SAMPLING (ROS)
(2/7)
3. RANDOM OVER-SAMPLING (ROS)
(3/7)
1. Determine the class distribution in your dataset to identify the majority and
minority classes.
2. Randomly select instances from the minority class and duplicate them until
the number of instances in the minority class matches that of the majority
class or achieves the desired proportion.
3. Combine the original data with the newly duplicated instances to form a
new, balanced dataset.
3. RANDOM OVER-SAMPLING (ROS)
(4/7)
3. RANDOM OVER-SAMPLING (ROS)
(5/7)
3. RANDOM OVER-SAMPLING (ROS)
(6/7)
3. RANDOM OVER-SAMPLING (ROS)
(7/7)
◦ Overfitting: Since the same instances from the minority class are used
multiple times, the model might overfit to these instances, reducing its
generalization ability.
◦ Artificial data: The technique doesn’t create new information but simply
duplicates existing data, which might not always be beneficial for model
The data is clean and ready for
analysis!
https://
www.wooclap.com/
AT-HOME PRACTICE
PART V
CHAPTER 1 HOMEWORK
PRACTICE I PRACTICE II
1. Import the “weight-height” dataset; 1. Import the “melbourne_housing”
2. Check the Gender class distribution; dataset;
selected variable;
5. Remove abnormal points from the Weight
variable; 4. Permanently impute missing values
6. Permanently modify the abnormal points for the selected variable.
of the Weight variable.