1. Numeric variables
As mentioned in the previous lesson, most machine learning models will require your data to be in numeric format.
However, even if your raw data is all numeric, there is still a lot you can do to improve your features.
2. Types of numeric features
Numeric features can be used to represent a huge array of different characteristics and measurements. Pretty much anything that can be quantitatively measured can be recorded as numeric data. For example, age, the price of an item, counts, and even spatial data such as coordinates.
Depending on the use case, numeric features can be treated in several different ways. We will work through a few of the considerations and possible feature engineering steps to keep in mind when dealing with numeric data.
3. Does size matter?
One of the first questions you should ask when working with numeric features is whether the magnitude of the feature is its most important trait, or just its direction. For example, if you had a dataset of restaurant health and safety ratings containing the number of times a restaurant had major violations, you might care far more about whether the restaurant had any major violations at all (as you would rather not take any chances), over whether it was a repeat offender.
Looking at this toy dataset containing restaurant IDs and the number of times they had major violations, we can see that some restaurants have no major violations but many have one or more.
We will be creating a new binary column representing whether or not a restaurant committed any violation.
4. Binarizing numeric variables
Here we first create a new column Binary_Violation and set it to zero. Then, we use the dot loc notation to find all rows where Number_of_Violations is greater than zero and set the Binary_Violation column to 1.
5. Binarizing numeric variables
As you can see here, all rows where Number_of_Violations is equal to 0 are also zeros in Binary_Violation. However, for all rows where Number_of_Violations is greater than zero is 1 in Binary_Violation.
6. Binning numeric variables
An extension of this is perhaps you wish to group a numeric variable into more than two bins. This is often useful for variables such as age, wage brackets, etc where exact numbers are less relevant than the general magnitude of the value.
Consider the same dataset of restaurant health and safety ratings containing the number of times a restaurant has had major violations. This time we will be creating three groups, Group 1, for restaurants with no offenses, Group 2 for restaurants with one or two offenses and group 3 for all restaurants with three or more offenses.
Bins are created by using the pandas' cut() function. You can define the intervals using the bins argument as shown here, which in this case is a list of 4 values. You can also pass a list of labels like so.
7. Binning numeric variables
Note as we want to include 0 in the first bin, we must set the leftmost edge to lower than that, so all values between negative infinity and 0 are labeled as 1, all values equal to 1 or 2 are labeled as 2, and values greater than 2 are labeled as 3.
8. Lets start practicing!
Now you know how to binarize and bin numeric columns, it's time for you to put this into practice.