Logistic regression for breast cancer
In the last exercise, we did a first evaluation of the data. In this exercise, you will define a training and testing split for a logistic regression model on a breast cancer dataset. This is an important first step to running all machine learning models.
The breast cancer dataset is a sample dataset from sklearn
with various features from patients, and a target value of whether or not the patient has breast cancer. The data comes in a dictionary format, where the main data is stored in an array called data
, and the target values are stored in an array called target
. Hence, cancer_data.data
will be features and cancer_data.target
as targets. Sample data is loaded as cancer_data
along with pandas
as pd
. LogisticRegression
is available via sklearn.linear_model
.
This exercise is part of the course
Predicting CTR with Machine Learning in Python
Exercise instructions
- Define both
X
andy
usingdata
andtarget
, respectively. - Make
X_train
andy_train
the first 300 samples ofX
andy
, respectively, usingX[:300]
forX_train
. - Make
X_test
andy_test
the remainder ofX
andy
, respectively (excluding those first 300 samples), usingX[300:]
forX_test
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Define X and y
X = cancer_data.____
y = cancer_data.____
# Define training and testing data
X_train = X[____]
X_test = X[____]
y_train = y[____]
y_test = y[____]