0% found this document useful (0 votes)
20 views

Data Science Coding Challange

Data Science Coding Challange

Uploaded by

JS NDONG NDONG
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Data Science Coding Challange

Data Science Coding Challange

Uploaded by

JS NDONG NDONG
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Welcome to the Data Science Coding

Challange!
Test your skills in a real-world coding challenge. Coding Challenges provide CS & DS Coding
Competitions with Prizes and achievement badges!

CS & DS learners want to be challenged as a way to evaluate if they’re job ready. So, why not create
fun challenges and give winners something truly valuable such as complimentary access to select
Data Science courses, or the ability to receive an achievement badge on their Coursera Skills Profile
- highlighting their performance to recruiters.

Introduction
In this challenge, you'll get the opportunity to tackle one of the most industry-relevant machine
learning problems with a unique dataset that will put your modeling skills to the test. Financial loan
services are leveraged by companies across many industries, from big banks to financial institutions
to government loans. One of the primary objectives of companies with financial loan services is to
decrease payment defaults and ensure that individuals are paying back their loans as expected. In
order to do this efficiently and systematically, many companies employ machine learning to predict
which individuals are at the highest risk of defaulting on their loans, so that proper interventions can
be effectively deployed to the right audience.

In this challenge, we will be tackling the loan default prediction problem on a very unique and
interesting group of individuals who have taken financial loans.

Imagine that you are a new data scientist at a major financial institution and you are tasked with
building a model that can predict which individuals will default on their loan payments. We have
provided a dataset that is a sample of individuals who received loans in 2021.

This financial institution has a vested interest in understanding the likelihood of each individual to
default on their loan payments so that resources can be allocated appropriately to support these
borrowers. In this challenge, you will use your machine learning toolkit to do just that!

Understanding the Datasets


Train vs. Test
In this competition, you’ll gain access to two datasets that are samples of past borrowers of a
financial institution that contain information about the individual and the specific loan. One dataset is
titled train.csv and the other is titled test.csv.

train.csv contains 70% of the overall sample (255,347 borrowers to be exact) and importantly, will
reveal whether or not the borrower has defaulted on their loan payments (the “ground truth”).

The test.csv dataset contains the exact same information about the remaining segment of the
overall sample (109,435 borrowers to be exact), but does not disclose the “ground truth” for each
borrower. It’s your job to predict this outcome!

Using the patterns you find in the train.csv data, predict whether the borrowers in test.csv will
default on their loan payments, or not.
Dataset descriptions
Both train.csv and test.csv contain one row for each unique Loan. For each Loan, a single
observation (LoanID) is included during which the loan was active.

In addition to this identifier column, the train.csv dataset also contains the target label for the
task, a binary column Default which indicates if a borrower has defaulted on payments.

Besides that column, both datasets have an identical set of features that can be used to train your
model to make predictions. Below you can see descriptions of each feature. Familiarize yourself with
them so that you can harness them most effectively for this machine learning task!

How to Submit your Predictions to Coursera¶


Submission Format:

In this notebook you should follow the steps below to explore the data, train a model using the data
in train.csv, and then score your model using the data in test.csv. Your final submission
should be a dataframe (call it prediction_df with two columns and exactly 109,435 rows (plus a
header row). The first column should be LoanID so that we know which prediction belongs to which
observation. The second column should be called predicted_probability and should be a
numeric column representing the likelihood that the borrower will default.

Your submission will show an error if you have extra columns


(beyond LoanID and predicted_probability) or extra rows. The order of the rows does not
matter.

The naming convention of the dataframe and columns are critical for our autograding, so please
make sure to use the exact naming conventions of prediction_df with column
names LoanID and predicted_probability!

To determine your final score, we will compare your predicted_probability predictions to the
source of truth labels for the observations in test.csv and calculate the ROC AUC. We choose this
metric because we not only want to be able to predict which loans will default, but also want a well-
calibrated likelihood score that can be used to target interventions and support most accurately.

Import Python Modules


First, import the primary modules that will be used in this project. Remember as this is an open-
ended project please feel free to make use of any of your favorite libraries that you feel may be
useful for this challenge. For example some of the following popular packages may be useful:

 pandas
 numpy
 Scipy
 Scikit-learn
 keras
 maplotlib
 seaborn
 etc, etc
Explore, Clean, Validate, and Visualize the Data (optional)¶
Feel free to explore, clean, validate, and visualize the data however you see fit for this competition to
help determine or optimize your predictive model. Please note - the final autograding will only be on
the accuracy of the prediction_df predictions.

Make predictions (required)¶


Remember you should create a dataframe named prediction_df with exactly 109,435 entries
plus a header row attempting to predict the likelihood of borrowers to default on their loans
in test_df. Your submission will throw an error if you have extra columns
(beyond LoanID and predicted_probaility) or extra rows.

The file should have exactly 2 columns: LoanID (sorted in any


order) predicted_probability (contains your numeric predicted probabilities between 0 and 1,
e.g. from estimator.predict_proba(X, y)[:, 1])

The naming convention of the dataframe and columns are critical for our autograding, so please
make sure to use the exact naming conventions of prediction_df with column
names LoanID and predicted_probability!

Example prediction submission:


The code below is a very naive prediction method that simply predicts loan defaults using a Dummy
Classifier. This is used as just an example showing the submission format required. Please
change/alter/delete this code below and create your own improved prediction methods for
generating prediction_df.
PLEASE CHANGE CODE BELOW TO IMPLEMENT YOUR OWN PREDICTIONS

SUBMIT YOUR WORK!¶


Once we are happy with our prediction_df and prediction_submission.csv we can now
submit for autograding! Submit by using the blue Submit Assignment at the top of your notebook.
Don't worry if your initial submission isn't perfect as you have multiple submission attempts and will
obtain some feedback after each submission!

You might also like