1. Dealing with other data issues
Up to this point you have used multiple approaches to creating and updating features when missing values are present in the data, but data issues are of course not limited to just this. In some instances, you will come across features that need to be updated in some other way.
Take for example the case of a column containing a monetary value. If this dataset has been imported from excel it may contain characters such as currency signs or commas that prevents pandas from reading it as numeric values.
2. Bad characters
For example, lets look at the data type of the RawSalary column. It's an object, although intuitively, you know that it should be numeric. So why is that?
3. Bad characters
Let's take a quick peek at the data. Numeric columns should not contain any non-numeric characters. So you need to remove these commas.
4. Dealing with bad characters
Although you want the column to be a numeric column, it is of type object, which means you can use string methods to fix this column. In this case, we want to remove all occurrences of comma.
We can easily achieve this by accessing the str accessor and using the replace() method. The first argument is the string you want to replace, which is the comma, and the second argument is the string you want to replace it with, which here is an empty string, which simply means you want to remove all the commas.
However, the data type of this column is still object. Now you can convert your column to the relevant type as shown here.
5. Finding other stray characters
But what if attempting to change the data type raises an error? This may indicate that there are additional stray characters which you didn't account for. Instead of manually searching for values with other stray characters you can use the to_numeric() function from pandas along with the errors argument. If you set the errors argument to 'coerce', Pandas will convert the column to numeric, but all values that can't be converted to numeric will be changed to NaNs, that is missing values.
6. Finding other stray characters
You can now use the isna() method like you did earlier to find out which values failed to parse. So it looks like we also have dollar signs. You can again use the replace() method as before to remove the dollar signs.
7. Chaining methods
Before you get going onto trying these for yourself, it will be useful to delve a little deeper into method chaining. If you are applying different methods or in fact the same method several times on a column, instead of assigning the result back to the column after each iteration, you can simply chain the methods, that is, call one method after the other to obtain the desired result.
For example, cleaning up characters, changing the data type, normalizing the values etc. can all be achieved by simply calling the methods one after the other as seen here.
8. Go ahead and fix bad characters!
Now that you know how to deal with stray characters, let's put it into practice.