Predict Home Credit Defaults

October 17, 2019data-science

Overview

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Tonight's project examines a dataset from a real bank that focuses on lending to people with little or no credit history. Their goal is to ensure that clients capable of repayment are not rejected. You will explore the dataset and make predictions whether someone will default or not, based on their application for a loan.

Your Task

Your goal is to train a binary classification model on the data in default_risk_train_data.csv that optimized area under the ROC curve between the predicted probability and the observed target. For each SK_ID_CURR in default_risk_train_data.csv, you must predict a probability for the TARGET variable. Your deliverable to the bank will be a CSV with predictions for each SK_ID_CURR in the test set.

Setup

For this challenge you will need Python 3.7, pipenv, and git installed. If you're not familiar with pipenv, it's a packaing tool for Python that effectively replaced the pip+virtualenv+requirements.txt workflow. If you already have pip installed, the easiest way to install pipenv is with pip install --user pipenv; however, a better way for Mac/Linux Homebrew users is to instead run brew install pipenv. More options can be found here.
The project is in the ChiPy project night repo. If you do not have the repository already, run

git clone https://github.com/chicagopython/CodingWorkshops.git
Navigate to the folder for this challenge:

cd CodingWorkshops/problems/data_science/home_credit_default_risk
Run pipenv install, which will install all of the libraries we have recommended for this exercise.
After you've installed all of the libraries, run pipenv shell, which will turn on a virtual environment running Python 3.7.
From within the shell, run jupyter lab default_risk.ipynb to launch the pre-started notebook.
To exit the pipenv shell when you are done, simply type exit.

What's in this repository?

There are three data files, one metadata file, and a jupyter notebook.

default_risk_train_data.csv -- The data you will use to train your models. Includes all potential features and the target.
default_risk_test_data.csv -- The data you will use to test your models. Includes all potential features, but NOT the target (which theoretically reflect unknown future default status).
perfect_deliverable.csv -- The CSV with perfect predictions for each SK_ID_CURR in the test set. You should only use this at the very end to test the model and NEVER factor it into training your model. To prevent overfitting, you should test models sparingly. This is the same format the final deliverable should be submitted to the bank in.
default_risk_column_descriptions.csv -- Descriptive metadata for the columns found in the train and test datasets.
default_risk.ipynb -- The jupyer notebook where all coding should be completed, unless you opt to work in a different environment.

This project is based on a Kaggle competition, with a subset of the data provided for the sake of download size. Note that this data has not been cleaned for you, and you should expect to deal with real world data issues, such as missing values, bad values, class imbalances, etc.

So what should we do?

To successfully complete this challenge, you'll need to: 1. become an expert on the data, 2. clean the data, 3. engineer the features for your model(s), 4. test/validate your models, 5. generate the deliverable the bank expects.

Here are some tips/questions to consider along the way: - Identify which columns are numerical and which are categorical - Which columns are missing values, and what should be done about the missing values? - Which features are relevant and why? - Which features might you want to remove? - What new features might you create? - How will you deal with categorical data (e.g. Label Encoding, One-Hot encoding, etc). - Is there any class imbalance? - What models will you try? sklearn has been installed in your environment; and linear regression, logistic regression, and random forest models have been imported in the given notebook. Feel free, however, to use the library/models of your choice.

Overview

Your Task

Setup

What's in this repository?

So what should we do?

Comments