Get startedGet started for free

Feature engineering

1. Feature engineering

Well done on the exercises! We will now look into feature engineering.

2. Feature engineering

After the design phase, feature engineering is the next step in the machine learning development process.

3. Feature engineering

Feature engineering is the process of selecting, manipulating, and transforming raw data into features. A feature is a variable, such as a column in a table. How we create features from the data is an important part of machine learning development. We can use the features as they appear in the raw data, but we can also build our own. Let's look at an example.

4. Customer data

In this example dataset, we have a set of customers and some information about their orders. The number of orders and total expenditure of each customer are two different features. To build a new feature, we could calculate the average expenditure per customer.

5. Customer data

In this way, we have engineered a new feature. Apart from the order data in this example, there is often much more data available in actual machine learning projects, and thus a lot of possibilities to engineer features.

6. Feature engineering

The goal of feature engineering is to enhance model performance by identifying the most informative features. Rather than increasing the number of features indiscriminately, it's about selecting the most relevant ones. Tools and techniques like feature selection, a feature store, and data version control, help automate and structure the feature engineering process.

7. Feature selection

Leveraging domain-specific knowledge is essential for creating and selecting features that are both relevant and impactful. This expertise helps ensure that the features used in the model are not only statistically significant but also meaningful in the context of the problem. Additionally, we can examine correlations to identify and eliminate redundant features. We can also evaluate feature importance to understand which features significantly influence the model’s predictions. Other techniques such as univariate selection, Principal Component Analysis (PCA), and Recursive Feature Elimination (RFE) can further refine the feature set, enhancing model performance by focusing on the most valuable features.

8. The feature store

A feature store is a centralized repository for features, allowing data scientists to discover, define, and reuse features across projects. Feature stores are essential in large teams where multiple projects need consistent and reusable features. However, they're not always necessary. If features are already in an optimal format or if projects are less complex, a feature store may not add significant value.

9. Data version control

Data version control is a tool that can be useful for tracking dataset changes and maintaining consistency throughout the machine learning lifecycle. It works in similar fashion as Git, which is a tool that allows us to track changes in code over time. Data version control helps to track versions of datasets such that we could easily find and rollback to earlier versions.

10. Let's practice!

Now that we've learned about feature engineering, let's dive into some exercises.