Scaling the data
For ML algorithms using distance based metrics, it is crucial to always scale your data, as features using different scales will distort your results. K-means uses the Euclidian distance to assess distance to cluster centroids, therefore you first need to scale your data before continuing to implement the algorithm. Let's do that first.
Available is the dataframe df
from the previous exercise, with some minor data preparation done so it is ready for you to use with sklearn
. The fraud labels are separately stored under labels
, you can use those to check the results later. numpy
has been imported as np
.
Este exercício faz parte do curso
Fraud Detection in Python
Instruções do exercício
- Import the
MinMaxScaler
. - Transform your dataframe
df
into a numpy arrayX
by taking only the values ofdf
and make sure you have allfloat
values. - Apply the defined scaler onto
X
to obtain scaled values ofX_scaled
to force all your features to a 0-1 scale.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
# Import the scaler
from sklearn.preprocessing import ____
# Take the float values of df for X
X = df.values.astype(np.____)
# Define the scaler and apply to the data
scaler = ____()
X_scaled = scaler.____(X)