Logistic regression for breast cancer
In the last exercise, we did a first evaluation of the data. In this exercise, you will define a training and testing split for a logistic regression model on a breast cancer dataset. This is an important first step to running all machine learning models.
The breast cancer dataset is a sample dataset from sklearn with various features from patients, and a target value of whether or not the patient has breast cancer. The data comes in a dictionary format, where the main data is stored in an array called data, and the target values are stored in an array called target. Hence, cancer_data.data will be features and cancer_data.target as targets. Sample data is loaded as cancer_data along with pandas as pd. LogisticRegression is available via sklearn.linear_model.
This exercise is part of the course
Predicting CTR with Machine Learning in Python
Exercise instructions
- Define both
Xandyusingdataandtarget, respectively. - Make
X_trainandy_trainthe first 300 samples ofXandy, respectively, usingX[:300]forX_train. - Make
X_testandy_testthe remainder ofXandy, respectively (excluding those first 300 samples), usingX[300:]forX_test.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Define X and y
X = cancer_data.____
y = cancer_data.____
# Define training and testing data
X_train = X[____]
X_test = X[____]
y_train = y[____]
y_test = y[____]