FDS Unit 2
FDS Unit 2
Creating a dataset for regression, classification, and clustering problems using python:
To create a dataset for a classification problem with python, we use
the make_classification method available in the sci-kit learn library.
The make_classification method returns by default, ndarrays which corresponds to the
variable/feature and the target/output. To generate a classification dataset, the method will
require the following parameters:
n_samples: the number of samples/rows.
n_features: the number of features/columns.
n_informative: the number of features that have a role in the prediction of the
output.
n_redundant: the number of features that are not related to the output class.
n_classes: the number of classes/labels for the classification problem.
weights: the proportion of samples for each output/class.
Inserting None means balanced classes
How to create a dataset for regression problems with python?
Similarly to make_classification, the make_regression method returns by default, ndarrays
which corresponds to the variable/feature and the target/output. To generate a regression
dataset, the method will require the following parameters:
n_samples: the number of samples/rows
n_features: the number of features/columns
n_informative: the number of informative variables
n_target: the number of regression targets/output. So a value of 2 means each
sample will have 2 outputs.
Noise: the standard deviation of the gaussian noise on the output
shuffle: mix the samples and the features.
coef: Return or not the coefficients of the underlying linear model.
random state: state the seed for the random number generator, to reproduce
the same dataset in case of reuse
DATA IMPORTING
Python has various modules which help us in importing the external data in various file
formats to a python program. In this example we will see how to import data of various
formats to a python program.
The csv module enables us to read each of the row in the file using a comma as a delimiter.
We first open the file in read only mode and then assign the delimiter. Finally use a for loop
to read each row from the csv file.
With pandas
The pandas library can actually handle most of the file types inclusing csv file. In this
program let see how pandas library handles the excel file using the read_excel module.
With pyodbc
We can also connect to database servers using a module called pyodbc. This will help us
import data from relational sources using a sql query. Ofcourse we also have to define the
connection details to the db before passing on the query.
EXPORTING DATA
1. Step 1: Install the Pandas package. If you haven't already done so, install the Pandas
package.
2. Step 2: Capture the path where your text file is stored.
3. Step 3: Specify the path where the new CSV file will be saved.
4. Step 4: Convert the text file to CSV using Python.
Before we dive into code, it’s important to understand the sources of missing data. Here’s
some typical reasons why data is missing:
User forgot to fill in a field.
Data was lost while transferring manually from a legacy database.
There was a programming error.
Users chose not to fill out a field tied to their beliefs about how the results would
be used or interpreted.
On the other hand, you may have to deal with missing data on your own. The first common
strategy for dealing with missing data is to delete the rows with missing values. Typically,
any row which has a missing value in any cell gets deleted. However, this often means many
rows will get removed, leading to loss of information and data. Therefore, this method is
typically not used when there are few data samples.
You can also impute the missing data. This can be based solely on information in the
column that has missing values, or it can be based on other columns present in the dataset.
Finally, you can use classification or regression models to predict missing values.
Replace it with a constant value. This can be a good approach when used in
discussion with the domain expert for the data we are dealing with.
Replace it with the mean or median. This is a decent approach when the data size is
small—but it does add bias.
Replace it with values by using information from other columns.
In the employee dataset subset below, we have salary data missing in three rows. We also
have State and Years of Experience columns in the dataset:
The first approach is to fill the missing values with the mean of the column. Here, we are
solely using the information from the column which has missing values:
With the help of a domain expert, we can do little better by using information from other
columns in the dataset. The average salary is different for different states, so we can use that
to fill in the values. For example, calculate the average salary of people working in Texas and
replace the missing data with an average salary of people who typically work in Texas:
What else can we do better? How about making use of the Years of Experience column as
well? Calculate the average entry-level salary of people working in Texas and replace the row
where the salary is missing for an entry-level person in Texas. Do the same for the mid-level
and high-level salaries:
Note that there are some boundary conditions. For example, there might be a row that has
missing values in both the Salary and Years of Experience columns. There are multiple ways
to handle this, but the most straightforward is to replace the missing value with the average
salary in Texas.
For example, say we have a column for Education with two possible values: High School and
College. If there are more people with a college degree in the dataset, we can replace the
missing value with College Degree:
We can tweak this more by making use of information in the other columns. For example, if
there are more people from Texas with High School in the dataset, replace the missing values
in rows for people from Texas with High School.
One can also create a classification model. The column to predict here is Education, using
other columns in the dataset. But the most common and popular approach is to model the
missing value in a categorical column as a new category called Unknown:
In summary, you’ll use different approaches to handle missing data values while data
cleaning depending on the type of data and the problem at hand. If you have access to a
domain expert, always incorporate their expert advice when filling in the missing values.