0% found this document useful (0 votes)
3 views

Chapter 1. Data Preparation (2)

Chapter 1 of the document focuses on data preparation for econometrics and data analysis, outlining steps for creating and managing code sheets, importing libraries, and handling data. It details how to detect and manage missing data, abnormal data points, and imbalanced datasets, emphasizing the importance of cleaning data before analysis. The chapter provides practical guidance on using Google Colab, as well as techniques for data imputation and removal.

Uploaded by

mpineau2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Chapter 1. Data Preparation (2)

Chapter 1 of the document focuses on data preparation for econometrics and data analysis, outlining steps for creating and managing code sheets, importing libraries, and handling data. It details how to detect and manage missing data, abnormal data points, and imbalanced datasets, emphasizing the importance of cleaning data before analysis. The chapter provides practical guidance on using Google Colab, as well as techniques for data imputation and removal.

Uploaded by

mpineau2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

CHAPTER 1.

DATA PREPARATION
INTERMEDIATE ECONOMETRICS & DATA ANALYSIS
CHAPTER 1. PLAN

PART I. CODE SHEET PART IV. DATA PREPERATION


A. Create a new code sheet A. Missing data
1. Detection
B. Open an existing code sheet
2. Removal
3. Imputation
B. Abnormal data points
1. Detection
2. Removal
PART II. LIBRARIES 3. Modification

C. Import libraries C. Imbalanced datasets


1. Detection
D. Install libraries
2. Random Under-Sampling
3. Random Over-Sampling

PART III. DATA IMPORT PART V. AT-HOME PRACTICE


E. Import data • “weight-height.csv”
F. Read data • “melbourne_housing.csv”
CODE SHEET
PART I
A. CREATE A NEW CODE SHEET

To start a new code sheet : Use Chrome to avoid problems


1. Log in to your Google account. when running the code.
2. Go to Google Colab: https://colab.research.google.com/

3. Click on “New notebook” in the pop-up window. Alternatively, you can click “File” and then
“New notebook”.

4. If your Google colab is in French, click on “Aide” then on “Afficher en anglais”.


B. OPEN AN EXISTING CODE SHEET

To open an already existing code sheet :


1. Log in to your Google account.

2. Go to Google Colab: https://colab.research.google.com/

3. Click on “Upload” in the pop-up window. Then select a file to upload it. Alternatively, you
can click “File” and then “Upload notebook”.
LIBRARIES
PART II
A. IMPORT LIBRARIES
(1/3)

• First, always ensure that you import the necessary libraries at the beginning.
Otherwise, you will need to write each line of the code manually.

• Then, use the following syntax to import a library:

• The abbreviation speeds up our coding by allowing us to avoid typing the full
name each time we need it.
A. IMPORT LIBRARIES
(2/3)

• Here is how we will import our main libraries:

• Please note that you don’t need to import all libraries at once. You can start
with the ones you need and import additional libraries as you progress
through your code.
A. IMPORT LIBRARIES
(3/3)

• Sometimes, you may only need a specific tool from a library. In such cases,
you can import just that tool from the library using the following syntax:

• Lastly, all the libraries we will use are already available in Google Colab, so
no installation is required. You only need to import them to use them in your
code sheet.

• However, if you are curious about how to install a library, here is an example:
B. INSTALL LIBRARIES
(1/2)
• If you attempt to import a library named “lasio”, which is used to open a specific file
type (LAS):

• This error message indicates that the library does not exist and must be installed first.
B. INSTALL LIBRARIES
(2/2)
• In this case, use the following syntax to install the library, and then you can
import it:

• The library is now installed, and no error message appeared when running
the “import lasio” line of code.
DATA IMPORT
PART III
A. IMPORT DATA
(1/3)

• Now that we have the coding sheet and the necessary libraries, we can import our
data. This can be done in several ways, but one of the simplest methods is to:
1. Run the line : “from google.colab import files”;

2. Run the line : “files.upload()”;

3. Select and upload the file directly from your laptop.

• Let’s import the “weight-height” dataset.


A. IMPORT DATA
(2/3)

• Please be aware that if third-party cookies are disabled in your web browser,
you will encounter the following error message:
A. IMPORT DATA
(3/3)
• To fix this error, you should:
1. Go to your browser settings;

2. Add the following link to the whitelist: : https://[*.]googleusercontent.com:443


B. READ DATA
(1/3)

• Importing the data refers to the process of loading data from external
sources, such as CSV or Excel files, into your Python environment.

• Reading the data, on the other hand, specifically means accessing the data
and loading it into memory so it can be used within your Python program.

• Reading the data is typically part of the importing process. Pandas is


commonly used for this purpose because it offers powerful and flexible tools
for data manipulation and analysis.
B. READ DATA
(2/3)

• We also give the dataset a simple name like “df” (short for DataFrame) to
make referencing it easier, so we don’t have to write the full name each time.

• Next, we need to identify the type of file that contains our data, as the code
will vary accordingly. Here is the syntax for the two main file types we will be
using: CSV and Excel.
B. READ DATA
(3/3)
• Finally, we display the dataset that has been imported and read by Pandas.
You have three options for displaying the data:
1. Show all rows of the dataset.

2. Specify a limited number of rows to display.

3. Display the default number of rows (5 rows).

• Let’s apply this to the “weight-height” dataset after importing the necessary
libraries and reading the data.
DATA PREPARATION
PART IV
DATA PREPARATION PROCESS

• Data collected from the real world is often imperfect and needs to be prepared
before any analysis. Typically, you will encounter issues such as errors, missing
values, outliers, or imbalanced datasets. These problems must be addressed before
you start analyzing the data.

• In summary, the following steps need to be followed: First, collect the data. Then,
identify impurities using methods such as descriptive statistics. Then, prepare the data
by addressing these impurities. Finally, analyze the cleaned data. For this process, you
will need two libraries: Pandas and NumPy (for computations such as the mean).
A. MISSING DATA
A. MISSING DATA
(1/6)

• Missing data refers to one or more missing values in an observation (a row)


or a variable (a column).

• A missing value can be:

◦ Missing Not At Random (MNAR);

◦ Missing At Random (MAR);

◦ Missing Completely At Random (MCAR).


A. MISSING DATA
(2/6)

• A value is considered Missing Not At Random (MNAR) if the missingness is


related to unobserved data (i.e., the missing value itself) and not to observed
data (i.e., other variables in the dataset).

• For example, if a researcher focuses solely on the years when tax fraud
occurred and intentionally omits the years when it did not, this reflects
missingness that is directly related to unobserved values (the years when
fraud did not occur).
A. MISSING DATA
(3/6)

• A value is considered Missing At Random (MAR) if the missingness is related


to observed data (i.e., other variables in the dataset) and not to unobserved
data (i.e., the missing value itself).

• For instance, if a survey respondent deliberately skips the question “How


many books do you read per month?” because of their level of education (an
observed variable) and not because of the actual number of books they read
(an unobserved value).
A. MISSING DATA
(4/6)

• A value is considered Missing Completely At Random (MCAR) if the missingness


is completely unrelated to either the observed data (i.e., other variables in the
dataset) or the unobserved data (i.e., the missing value itself).

• In other words, the missing data is purely random or caused by external factors
unrelated to the dataset. For example:
◦ A survey respondent accidentally skips a question (e.g., “What is your favorite
movie?”);

◦ A researcher is unable to find a specific piece of information (e.g., a company’s net


income for a particular year).
A. MISSING DATA
(5/6)

• In some cases, missing data can be informative, as its absence may convey valuable
information, leading to context-aware insights and a deeper understanding. For
example, if a teacher records the number of students absent for each session and
nothing is recorded for a particular session, it may indicate that no students were
absent that day.

• Therefore, understanding these types (i.e., MNAR, MAR or MCAR) is essential for
choosing the right methods to handle missing data in your analysis. Common
approaches to handling missing data include removing the missing values, imputing
them, or using advanced modeling techniques designed to account for the missing data.
A. MISSING DATA
(6/6)

• To remove missing data, you have two options:


◦ Remove the variable: a column is deleted when it contains many missing values.

◦ Remove the observation: a row is deleted when it contains many missing values.

• To impute missing data, choose a value based on the type of variable:


◦ Quantitative: use the mean or median to fill in missing values.

◦ Qualitative: use the mode to fill in missing values.


1. DETECT MISSING VALUES

• In this section, we will cover how to detect:

a. The total number of missing values;

b. The number of missing values per variable (per column);

c. The number of missing values per observation (per row).

• The dataset we will use is as follows – open the “2- Missing Data” Python
script:
a. DETECT ALL MISSING VALUES

• To find the total number of missing values, use the following syntax:
b. DETECT MISSING VALUES PER VARIABLE

• Since we are using only the first “sum()”, we directly obtain the number of
missing values per variable (per column). Alternatively, using “sum(axis=0)”
will yield the same results.
c. DETECT MISSING VALUES PER
OBSERVATION

• To identify observations with missing values, you need to use “sum(axis=1)”.


Otherwise, you will get the number of missing values per variable (per
column).
2. REMOVE MISSING VALUES

• To permanently apply changes (including those for outliers), use the


“inplace=True” parameter. For example:

DELETE OBSERVATIONS DELETE VARIABLES


3. IMPUTE MISSING VALUES

THE MEAN/MEDIAN THE MODE


B. ABNORMAL DATA POINTS
B. ABNORMAL DATA POINTS
(1/4)
• Not all data points are equal. Some are valuable and contribute positively to
the analysis, while others may introduce bias.

• To avoid bias, it is important to carefully examine problematic data points.


These are points that are abnormally far from other data and often represent
extreme maximums or minimums.

• There are three types of abnormal data points:


◦ Extreme values;

◦ Outliers;

◦ Influential points.
B. ABNORMAL DATA POINTS
(2/4)
INFLUENTIAL
EXTREME VALUES OUTLIERS POINTS
• X value: normal distance. • X value: abnormal distance.
• X value: abnormal distance.
• Y value: does not follow the • Y value: does not follow the
• Y value: follows the trend. trend. trend.

•  The observation significantly •  A significant independent


 This point biases the model. It
differs from the others and may variable that could explain the
should be addressed before
not belong to the population. difference in Y is missing in this
analysis to achieve the most
model.
accurate and well-fitting model.
B. ABNORMAL DATA POINTS
(3/4)
Understanding the reason behind an abnormal data point is crucial for
addressing it effectively:

1. If it is an obvious error  Correct it: A grade of 150/20.

2. If it represents an omitted independent variable  Include it in the model: A house


that is significantly more expensive than others of the same size might be due to
factors like location, which were not initially considered if only the surface was taken
into account.

3. If the data point does not belong to our population  Remove it: In a study on the
impact of a car’s engine power on its price, a Ferrari included in the sample may be
irrelevant if we are not interested in luxury cars.
B. ABNORMAL DATA POINTS
(4/4)

Understanding the reason behind an abnormal data point is crucial for


addressing it effectively:

4. If the data point is part of our population but still differs significantly from other
points  Monitor it closely and include it in the model.

5. Otherwise, try different models and report the results for each scenario:

× With the problematic point included;

× Without the problematic point;

× With the problematic point, but after modification.


1. DETECT ABNORMAL POINTS
(1/5)
• Descriptive statistics offer several tools for detecting potentially problematic
points, including:
◦ Graphs: Such as histograms, boxplots, and scatter plots.

◦ Summary statistics: Such as the first quartile (Q1), third quartile (Q3), and
interquartile range (IQR).

• We typically begin with graphs to gain a general overview of the dataset and
identify any potential abnormal point. After that, we can apply methods such
as the “interquartile range (IQR) rule” to establish thresholds beyond which
points may be considered problematic.
1. DETECT ABNORMAL POINTS
(2/5)

• Note that the libraries required for this part are:


◦ NumPy for computations;

◦ Matplotlib.pyplot for visualizations.

• Let’s now focus on the Height variable from the “weight-height” dataset.
1. DETECT ABNORMAL POINTS
(3/5)

HISTOGRAM BOXPLOT SCATTER PLOT


1. DETECT ABNORMAL POINTS
(4/5)
1. DETECT ABNORMAL POINTS
(5/5)
2. REMOVE ABNORMAL POINTS
3. MODIFY ABNORMAL VALUES
(1/5)
• Note that a data point, even if it differs from the rest of the dataset, should
not be removed unless there is a strong objective reason to do so (e.g., bias,
the point not belonging to the population, etc.).

• If there is no reason to delete a data point, you can mitigate its impact by
modifying its value. To do this, follow these steps:
1. Remove the abnormal points using the same steps as before.

2. Report the removed points as missing data.

3. Impute the missing values using the same methods as before.


3. MODIFY ABNORMAL VALUES
(2/5)

• The required libraries in this case are:


◦ Pandas for handling missing values.

◦ NumPy for computations such as the mean, median, Q1, Q3, IQR, etc.

• Remember that handling data requires caution. Altering or deleting values is


not always appropriate and should be done with careful consideration.
3. MODIFY ABNORMAL VALUES
(3/5)
3. MODIFY ABNORMAL VALUES
(4/5)
3. MODIFY ABNORMAL VALUES
(5/5)
C. IMBALANCED DATASETS
C. IMBALANCED DATASETS
(1/3)

• Since this course focuses on classification problems, we will develop models to


classify new data points and determine their categories. Essentially, we will use a
dataset to build a model and then evaluate its performance on new data points.

• The dataset used for building the model and addressing the classification problem
includes:
◦ Several explanatory variables (qualitative and/or quantitative), referred to as “features”;

◦ One explained variable (qualitative), known as the “outcome”, “target” or “class”.

• In this context, one challenge we may encounter is dealing with imbalanced datasets.
C. IMBALANCED DATASETS
(2/3)

• An imbalanced dataset refers to a situation where one class is significantly


disproportionately represented compared to other classes. This can be a
serious issue and ignoring it may lead to biased results.

• A dataset with a slight imbalance is generally not problematic. However, there


is no specific threshold to define what constitutes a slight versus severe
imbalance. For this course, we will consider an imbalance ratio of up to 40:60
as acceptable.
C. IMBALANCED DATASETS
(3/3)

I
B M
A B
L A
A L
N A
C N
E C
D E
D
1. DETECT IMBALANCED CLASSES
(1/2)
• To determine if a dataset is balanced or not, follow these steps:
1. DETECT IMBALANCED CLASSES
(2/2)
2. RANDOM UNDER-SAMPLING (RUS)
(1/8)

• Random under-sampling is a data analysis technique used to handle


imbalanced datasets by decreasing the number of cases in the majority class.

• The aim is to create a more balanced class distribution, which can enhance
the performance of machine learning models sensitive to class imbalance.
2. RANDOM UNDER-SAMPLING (RUS)
(2/8)

• For example, if we have an imbalanced dataset with 1 “yes” and 5 “no”


responses, under-sampling would involve randomly selecting 1 “no” out of
the 5 available. This results in a balanced dataset with 1 “yes” and 1 “no”.
2. RANDOM UNDER-SAMPLING (RUS)
(3/8)

Here are the steps to perform random under-sampling:

1. Determine the class distribution in your dataset to identify the majority and
minority classes.

2. Randomly select instances from the majority class. Then, remove the
randomly selected instances from the majority class until its size is
comparable to the minority class or reaches a desired proportion.

3. Combine the remaining instances of the majority class with all instances of
the minority class to form a new, balanced dataset.
2. RANDOM UNDER-SAMPLING (RUS)
(4/8)
2. RANDOM UNDER-SAMPLING (RUS)
(5/8)
2. RANDOM UNDER-SAMPLING (RUS)
(6/8)
2. RANDOM UNDER-SAMPLING (RUS)
(7/8)

Random under-sampling is suitable when:

• The dataset is highly imbalanced, and the loss of some cases from the
majority class won’t significantly impact the model’s ability to learn;

• The dataset is very large, and reducing its size can enhance the efficiency of
the training process.
2. RANDOM UNDER-SAMPLING (RUS)
(8/8)

However, using this technique can result in a:

• Loss of information: Removing cases from the majority class can lead to the
loss of potentially valuable information, which might negatively affect the
model’s performance.

• Risk of under-representation: Important patterns and variations in the


majority class might be under-represented after under-sampling.
3. RANDOM OVER-SAMPLING (ROS)
(1/7)

• Random over-sampling is another technique used to address imbalanced


datasets. It involves increasing the number of instances in the minority class
by randomly duplicating existing instances until the classes are balanced or
the minority class reaches a desired proportion relative to the majority class.

• In the following example, the dataset is imbalanced with 1 “yes” and 5 “no”
cases. Over-sampling involves randomly adding 4 more “yes” instances. This
results in a balanced dataset with 5 “yes” and 5 “no” cases.
3. RANDOM OVER-SAMPLING (ROS)
(2/7)
3. RANDOM OVER-SAMPLING (ROS)
(3/7)

Here are the steps to perform random over-sampling:

1. Determine the class distribution in your dataset to identify the majority and
minority classes.

2. Randomly select instances from the minority class and duplicate them until
the number of instances in the minority class matches that of the majority
class or achieves the desired proportion.

3. Combine the original data with the newly duplicated instances to form a
new, balanced dataset.
3. RANDOM OVER-SAMPLING (ROS)
(4/7)
3. RANDOM OVER-SAMPLING (ROS)
(5/7)
3. RANDOM OVER-SAMPLING (ROS)
(6/7)
3. RANDOM OVER-SAMPLING (ROS)
(7/7)

• Random over-sampling is suitable when the dataset is highly imbalanced and


under-sampling the majority class would lead to a significant loss of valuable
information.

• However, using this technique can result in:

◦ Overfitting: Since the same instances from the minority class are used
multiple times, the model might overfit to these instances, reducing its
generalization ability.
◦ Artificial data: The technique doesn’t create new information but simply
duplicates existing data, which might not always be beneficial for model
The data is clean and ready for
analysis!
https://
www.wooclap.com/
AT-HOME PRACTICE
PART V
CHAPTER 1 HOMEWORK

PRACTICE I PRACTICE II
1. Import the “weight-height” dataset; 1. Import the “melbourne_housing”
2. Check the Gender class distribution; dataset;

3. If there is an imbalance, address it. 2. Detect missing values for each


Otherwise, proceed;
variable;
4. Detect abnormal points in the Weight
variable;
3. Remove missing values from a

selected variable;
5. Remove abnormal points from the Weight
variable; 4. Permanently impute missing values
6. Permanently modify the abnormal points for the selected variable.
of the Weight variable.

You might also like