Dealing with stray characters (II)
In the last exercise, you could tell quickly based off of the df.head()
call which characters were causing an issue. In many cases this will not be so apparent. There will often be values deep within a column that are preventing you from casting a column as a numeric type so that it can be used in a model or further feature engineering.
One approach to finding these values is to force the column to the data type desired using pd.to_numeric()
, coercing any values causing issues to NaN, Then filtering the DataFrame by just the rows containing the NaN values.
Try to cast the RawSalary
column as a float and it will fail as an additional character can now be found in it. Find the character and remove it so the column can be cast as a float.
This exercise is part of the course
Feature Engineering for Machine Learning in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Attempt to convert the column to numeric values
numeric_vals = ____(so_survey_df['RawSalary'], errors='coerce')
# Find the indexes of missing values
idx = ____
# Print the relevant rows
print(so_survey_df['RawSalary']____)