Get startedGet started for free

Why generate features?

1. Why generate features?

Hello and welcome to Feature Engineering for Machine Learning in Python. My name is Robert O’Callaghan and I am a Data Scientist.

2. Feature Engineering

Feature engineering is the act of taking raw data and extracting features from it that are suitable for tasks like machine learning. Most machine learning algorithms work with tabular data. When we talk about features, we are referring to the information stored in the columns of these tables. For example, if we were looking at information on houses, the features would be things like square foot, number of rooms, etc. This course is designed for data scientists who want to expand their knowledge of how to incorporate feature engineering into their data science workflow.

3. Different types of data

Most machine learning algorithms require their input data to be represented as a vector or a matrix, and many assume that the data is distributed normally. In the real world, more often than not you will receive data that is not in this format. You will also need to work with many different types of data, some data types you will often encounter are: continuous variables, categorical data, ordinal data, boolean values, and dates and times. Dealing with these is manageable, but requires a well thought out approach. Feature engineering is often overlooked in machine learning discussions, but any real-world practitioner will confirm that data manipulation and feature engineering is the most important aspect of the project.

4. Course structure

Over the span of this course, we will be addressing how to deal with many different types of data and how to convert them into a format that can be easily used for machine learning. In the first chapter, you will ingest and create basic features from tabular data. In the second chapter, you will learn how to deal with data that has missing values. You will then move on to transforming your data so that it conforms to statistical assumptions often necessary for machine learning models, and finally, you will convert free form text into tabular data so it can be used with machine learning models.

5. Pandas

Now lets jump straight in with some examples. During this course we will be leveraging the pandas package substantially as it is very useful when working with data in tabular form. It is a common practice to import pandas using the pd alias. You can use the read_csv() function to import a CSV file and use the head() method to quickly look at the first few rows of the DataFrame.

6. Dataset

For the first three chapters of this course, you will be working with a modified subset of the Stackoverflow survey response data. This dataset records the details and preferences of hundreds of users of the StackOverflow website.

7. Column names

To see the features used in this subset, you can use the DataFrame columns attribute to print the names of all the columns in the DataFrame.

8. Column types

To print the data type of each column, you can use the dtypes attribute. Here you can see three different data types - integers, floats and objects - in pandas objects are columns that contain strings.

9. Selecting specific data types

Knowing the types of each column can be very useful if you are performing analysis based on a subset of specific data types. To do this, you can use the select_dtypes() method and pass a list of relevant data types to the include argument. For example, if you want to select only the integer columns, call the select_dtypes() method on df and set the include argument to 'int'.

10. Lets get going!

Lets get right into it and start practicing.