XGBoost: Fit/Predict
It's time to create your first XGBoost model! As Sergey showed you in the video, you can use the scikit-learn .fit()
/ .predict()
paradigm that you are already familiar to build your XGBoost models, as the xgboost
library has a scikit-learn compatible API!
Here, you'll be working with churn data. This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. It has been pre-loaded for you into a DataFrame called churn_data
- explore it in the Shell!
Your goal is to use the first month's worth of data to predict whether the app's users will remain users of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, you'll split the data into training and test sets, fit a small xgboost
model on the training set, and evaluate its performance on the test set by computing its accuracy.
pandas
and numpy
have been imported as pd
and np
, and train_test_split
has been imported from sklearn.model_selection
. Additionally, the arrays for the features and the target have been created as X
and y
.
This exercise is part of the course
Extreme Gradient Boosting with XGBoost
Exercise instructions
- Import
xgboost
asxgb
. - Create training and test sets such that 20% of the data is used for testing. Use a
random_state
of123
. - Instantiate an
XGBoostClassifier
asxg_cl
usingxgb.XGBClassifier()
. Specifyn_estimators
to be10
estimators and anobjective
of'binary:logistic'
. Do not worry about what this means just yet, you will learn about these parameters later in this course. - Fit
xg_cl
to the training set (X_train, y_train)
using the.fit()
method. - Predict the labels of the test set (
X_test
) using the.predict()
method and hit 'Submit Answer' to print the accuracy.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import xgboost
____
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]
# Create the training and test sets
X_train, X_test, y_train, y_test= ____(____, ____, test_size=____, random_state=123)
# Instantiate the XGBClassifier: xg_cl
xg_cl = ____.____(____='____', ____=____, seed=123)
# Fit the classifier to the training set
____
# Predict the labels of the test set: preds
preds = ____
# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))