Data Science Coding Challange
Data Science Coding Challange
Challange!
Test your skills in a real-world coding challenge. Coding Challenges provide CS & DS Coding
Competitions with Prizes and achievement badges!
CS & DS learners want to be challenged as a way to evaluate if they’re job ready. So, why not create
fun challenges and give winners something truly valuable such as complimentary access to select
Data Science courses, or the ability to receive an achievement badge on their Coursera Skills Profile
- highlighting their performance to recruiters.
Introduction
In this challenge, you'll get the opportunity to tackle one of the most industry-relevant machine
learning problems with a unique dataset that will put your modeling skills to the test. Financial loan
services are leveraged by companies across many industries, from big banks to financial institutions
to government loans. One of the primary objectives of companies with financial loan services is to
decrease payment defaults and ensure that individuals are paying back their loans as expected. In
order to do this efficiently and systematically, many companies employ machine learning to predict
which individuals are at the highest risk of defaulting on their loans, so that proper interventions can
be effectively deployed to the right audience.
In this challenge, we will be tackling the loan default prediction problem on a very unique and
interesting group of individuals who have taken financial loans.
Imagine that you are a new data scientist at a major financial institution and you are tasked with
building a model that can predict which individuals will default on their loan payments. We have
provided a dataset that is a sample of individuals who received loans in 2021.
This financial institution has a vested interest in understanding the likelihood of each individual to
default on their loan payments so that resources can be allocated appropriately to support these
borrowers. In this challenge, you will use your machine learning toolkit to do just that!
train.csv contains 70% of the overall sample (255,347 borrowers to be exact) and importantly, will
reveal whether or not the borrower has defaulted on their loan payments (the “ground truth”).
The test.csv dataset contains the exact same information about the remaining segment of the
overall sample (109,435 borrowers to be exact), but does not disclose the “ground truth” for each
borrower. It’s your job to predict this outcome!
Using the patterns you find in the train.csv data, predict whether the borrowers in test.csv will
default on their loan payments, or not.
Dataset descriptions
Both train.csv and test.csv contain one row for each unique Loan. For each Loan, a single
observation (LoanID) is included during which the loan was active.
In addition to this identifier column, the train.csv dataset also contains the target label for the
task, a binary column Default which indicates if a borrower has defaulted on payments.
Besides that column, both datasets have an identical set of features that can be used to train your
model to make predictions. Below you can see descriptions of each feature. Familiarize yourself with
them so that you can harness them most effectively for this machine learning task!
In this notebook you should follow the steps below to explore the data, train a model using the data
in train.csv, and then score your model using the data in test.csv. Your final submission
should be a dataframe (call it prediction_df with two columns and exactly 109,435 rows (plus a
header row). The first column should be LoanID so that we know which prediction belongs to which
observation. The second column should be called predicted_probability and should be a
numeric column representing the likelihood that the borrower will default.
The naming convention of the dataframe and columns are critical for our autograding, so please
make sure to use the exact naming conventions of prediction_df with column
names LoanID and predicted_probability!
To determine your final score, we will compare your predicted_probability predictions to the
source of truth labels for the observations in test.csv and calculate the ROC AUC. We choose this
metric because we not only want to be able to predict which loans will default, but also want a well-
calibrated likelihood score that can be used to target interventions and support most accurately.
pandas
numpy
Scipy
Scikit-learn
keras
maplotlib
seaborn
etc, etc
Explore, Clean, Validate, and Visualize the Data (optional)¶
Feel free to explore, clean, validate, and visualize the data however you see fit for this competition to
help determine or optimize your predictive model. Please note - the final autograding will only be on
the accuracy of the prediction_df predictions.
The naming convention of the dataframe and columns are critical for our autograding, so please
make sure to use the exact naming conventions of prediction_df with column
names LoanID and predicted_probability!