0% found this document useful (0 votes)
24 views

FDS Unit 2

The document discusses different methods for handling missing data in machine learning datasets. It describes three common types of missing data: missing completely at random, missing at random, and missing not at random. It then outlines three strategies for handling missing data: deleting rows with missing values, imputing missing values, and using models to predict missing values. For numerical columns with missing values, methods like replacing with the mean, median, or values from other columns are proposed. For categorical columns, replacing with a constant or most popular value is recommended.

Uploaded by

Amit Adhikari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

FDS Unit 2

The document discusses different methods for handling missing data in machine learning datasets. It describes three common types of missing data: missing completely at random, missing at random, and missing not at random. It then outlines three strategies for handling missing data: deleting rows with missing values, imputing missing values, and using models to predict missing values. For numerical columns with missing values, methods like replacing with the mean, median, or values from other columns are proposed. For categorical columns, replacing with a constant or most popular value is recommended.

Uploaded by

Amit Adhikari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

How to create a dataset for a classification problem with python?

Creating a dataset for regression, classification, and clustering problems using python:
To create a dataset for a classification problem with python, we use
the make_classification method available in the sci-kit learn library.
The make_classification method returns by default, ndarrays which corresponds to the
variable/feature and the target/output. To generate a classification dataset, the method will
require the following parameters:
 n_samples: the number of samples/rows.
 n_features: the number of features/columns.
 n_informative: the number of features that have a role in the prediction of the
output.
 n_redundant: the number of features that are not related to the output class.
 n_classes: the number of classes/labels for the classification problem.
 weights: the proportion of samples for each output/class.
Inserting None means balanced classes
How to create a dataset for regression problems with python?
Similarly to make_classification, the make_regression method returns by default, ndarrays
which corresponds to the variable/feature and the target/output. To generate a regression
dataset, the method will require the following parameters:
 n_samples: the number of samples/rows
 n_features: the number of features/columns
 n_informative: the number of informative variables
 n_target: the number of regression targets/output. So a value of 2 means each
sample will have 2 outputs.
 Noise: the standard deviation of the gaussian noise on the output
 shuffle: mix the samples and the features.
 coef: Return or not the coefficients of the underlying linear model.
 random state: state the seed for the random number generator, to reproduce
the same dataset in case of reuse

How to create a dataset for a clustering problem with python?


The make_blob method returns by default, ndarrays which corresponds to the
variable/feature/columns containing the data, and the target/output containing the labels for
the cluster’s numbers. To generate a clustering dataset, the method will require the following
parameters:
 n_samples: the number of samples/rows. Passed as an integer, it divides the
various points equally among clusters. And Passed as an array, each element
shows the number of samples per cluster.
 n_features: the number of features/columns
 centers: the number of centers (fixed center locations) to generate your
clusters.
 clusters_std: the standard deviation of each cluster
 shuffle: mixes the various rows/samples
 random state: defines the random number used for the generation of the
dataset
Random user information with Faker
Faker is one of the best early Python libraries to generate all types of random information.
Some commonly-used attributes Faker generate are:
 Personal info: name, birthday, email, password, address
 All kinds of date and timezone information
 Financial details: credit cards, SSNs, banking
 Misc: URLs, sentences, language codes

DATA IMPORTING

Python has various modules which help us in importing the external data in various file
formats to a python program. In this example we will see how to import data of various
formats to a python program.

Import csv file

The csv module enables us to read each of the row in the file using a comma as a delimiter.
We first open the file in read only mode and then assign the delimiter. Finally use a for loop
to read each row from the csv file.

With pandas

The pandas library can actually handle most of the file types inclusing csv file. In this
program let see how pandas library handles the excel file using the read_excel module.

With pyodbc

We can also connect to database servers using a module called pyodbc. This will help us
import data from relational sources using a sql query. Ofcourse we also have to define the
connection details to the db before passing on the query.

EXPORTING DATA

1. Step 1: Install the Pandas package. If you haven't already done so, install the Pandas
package.
2. Step 2: Capture the path where your text file is stored.
3. Step 3: Specify the path where the new CSV file will be saved.
4. Step 4: Convert the text file to CSV using Python.

Sources of Missing Values

Before we dive into code, it’s important to understand the sources of missing data. Here’s
some typical reasons why data is missing:
 User forgot to fill in a field.
 Data was lost while transferring manually from a legacy database.
 There was a programming error.
 Users chose not to fill out a field tied to their beliefs about how the results would
be used or interpreted.

Identify the Missing Data Values


Most analytics projects will encounter three possible types of missing data values, depending
on whether there’s a relationship between the missing data and the other data in the dataset:

 Missing completely at random (MCAR): In this case, there may be no pattern as to


why a column’s data is missing. For example, survey data is missing because
someone could not make it to an appointment, or an administrator misplaces the test
results he is supposed to enter into the computer. The reason for the missing values is
unrelated to the data in the dataset.
 Missing at random (MAR): In this scenario, the reason the data is missing in a
column can be explained by the data in other columns. For example, a school student
who scores above the cutoff is typically given a grade. So, a missing grade for a
student can be explained by the column that has scores below the cutoff. The reason
for these missing values can be described by data in another column.
 Missing not at random (MNAR): Sometimes, the missing value is related to the
value itself. For example, higher income people may not disclose their incomes. Here,
there is a correlation between the missing values and the actual income. The missing
values are not dependent on other variables in the dataset.

How to Handle Missing Data Values


Data teams can use a number of strategies to handle missing data. On one hand, algorithms
such as random forest and KNN are robust in dealing with missing values.

On the other hand, you may have to deal with missing data on your own. The first common
strategy for dealing with missing data is to delete the rows with missing values. Typically,
any row which has a missing value in any cell gets deleted. However, this often means many
rows will get removed, leading to loss of information and data. Therefore, this method is
typically not used when there are few data samples.

You can also impute the missing data. This can be based solely on information in the
column that has missing values, or it can be based on other columns present in the dataset.

Finally, you can use classification or regression models to predict missing values.

Let’s look at these three strategies in depth:


1. Missing Values in Numerical Columns
The first approach is to replace the missing value with one of the following strategies:

 Replace it with a constant value. This can be a good approach when used in
discussion with the domain expert for the data we are dealing with.
 Replace it with the mean or median. This is a decent approach when the data size is
small—but it does add bias.
 Replace it with values by using information from other columns.
In the employee dataset subset below, we have salary data missing in three rows. We also
have State and Years of Experience columns in the dataset:

The first approach is to fill the missing values with the mean of the column. Here, we are
solely using the information from the column which has missing values:
With the help of a domain expert, we can do little better by using information from other
columns in the dataset. The average salary is different for different states, so we can use that
to fill in the values. For example, calculate the average salary of people working in Texas and
replace the missing data with an average salary of people who typically work in Texas:

What else can we do better? How about making use of the Years of Experience column as
well? Calculate the average entry-level salary of people working in Texas and replace the row
where the salary is missing for an entry-level person in Texas. Do the same for the mid-level
and high-level salaries:
Note that there are some boundary conditions. For example, there might be a row that has
missing values in both the Salary and Years of Experience columns. There are multiple ways
to handle this, but the most straightforward is to replace the missing value with the average
salary in Texas.

2. Predicting Missing Values Using an Algorithm


Another way to predict missing values is to create a simple regression model. The column to
predict here is the Salary, using other columns in the dataset. If there are missing values in
the input columns, we must handle those conditions when creating the predictive model. A
simple way to manage this is to choose only the features that do not have missing values, or
take the rows that do not have missing values in any of the cells.

3. Missing Values in Categorical Columns


Dealing with missing data values in categorical columns is a lot easier than in numerical
columns. Simply replace the missing value with a constant value or the most popular
category. This is a good approach when the data size small, though it does add bias.

For example, say we have a column for Education with two possible values: High School and
College. If there are more people with a college degree in the dataset, we can replace the
missing value with College Degree:
We can tweak this more by making use of information in the other columns. For example, if
there are more people from Texas with High School in the dataset, replace the missing values
in rows for people from Texas with High School.

One can also create a classification model. The column to predict here is Education, using
other columns in the dataset. But the most common and popular approach is to model the
missing value in a categorical column as a new category called Unknown:

In summary, you’ll use different approaches to handle missing data values while data
cleaning depending on the type of data and the problem at hand. If you have access to a
domain expert, always incorporate their expert advice when filling in the missing values.

You might also like