Train/test distributions

In a machine learning interview, you will most certainly work with training data and test data. As discussed earlier, poor model performance can result if the distributions of training and test datasets differ.

In this exercise, you'll use functions from sklearn.model_selection as well as seaborn and matplotlib.pyplot to split loan_data into a training set and a test set, as well as visualize their distributions to spot any discrepancies.

Note that seaborn and matplotlib.pyplot have already been imported to your workspace and aliased as sns and plt, respectively.

The pipeline now includes Train/Test split:

Machine learning pipeline

Subset loan_data to only the Credit Score and Annual Income features, and the target variable Loan Status in that order.
Create an 80/20 split of loan_data and assign it to loan_data_subset.
Create pairplots of trainingSet and testSet (in that order) setting the hue argument to the target variable Loan Status.

Data Pre-processing and Visualization

Supervised Learning

Unsupervised Learning

Model Selection and Evaluation

Exercise

Train/test distributions

Instructions