In this chapter, you will explore what feature engineering is and how to get started with applying it to real-world data. You will load, explore and visualize a survey response dataset, and in doing so you will learn about its underlying data types and why they have an influence on how you should engineer your features. Using the pandas package you will create new features from both categorical and continuous columns.

Why generate features?

Getting to know your data

Selecting specific data types

Dealing with categorical features

One-hot encoding and dummy variables

Dealing with uncommon categories

Numeric variables

Binarizing columns

Binning values

Creating Features

This chapter introduces you to the reality of messy and incomplete data. You will learn how to find where your data has missing values and explore multiple approaches on how to deal with them. You will also use string manipulation techniques to deal with unwanted characters in your dataset.

Why do missing values exist?

How sparse is my data?

Finding the missing values

Dealing with missing values (I)

Listwise deletion

Replacing missing values with constants

Dealing with missing values (II)

Filling continuous missing values

Imputing values in predictive models

Dealing with other data issues

Dealing with stray characters (I)

Dealing with stray characters (II)

Method chaining

Dealing with Messy Data

In this chapter, you will focus on analyzing the underlying distribution of your data and whether it will impact your machine learning pipeline. You will learn how to deal with skewed data and situations where outliers may be negatively impacting your analysis.

Data distributions

What does your data look like? (I)

What does your data look like? (II)

When don't you have to transform your data?

Scaling and transformations

Normalization

Standardization

Log transformation

When can you use normalization?

Removing outliers

Percentage based outlier removal

Statistical outlier removal

Scaling and transforming new data

Train and testing transformations (I)

Train and testing transformations (II)

Conforming to Statistical Assumptions

Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created.

Codificación de texto

Limpiar tu texto

Características de texto de alto nivel

Recuento de palabras

Contando palabras (I)

Contando palabras (II)

Limitando tus características

Texto a DataFrame

Frecuencia de término–frecuencia inversa de documento

Tf-idf

Inspecciona valores Tf-idf

Transformar datos no vistos

N-grams

Usar n-gramas más largos

Encontrar las palabras más comunes

Cierre

Dealing with Text Data

Stack Overflow Survey Responses (Modified)

US Presidential Inauguration Addresses

Cada día lees sobre avances increíbles y cómo las aplicaciones más recientes de Machine Learning están cambiando el mundo. A menudo se pasa por alto que, antes de usar esos modelos sofisticados, hay que hacer mucho trabajo de limpieza de datos e ingeniería de características. En este curso aprenderás a hacerlo. Trabajarás con la encuesta de desarrolladores de Stack Overflow y con discursos históricos de investidura presidencial de EE. UU. para entender cómo preprocesar y diseñar características a partir de datos categóricos, continuos y no estructurados. Este curso te dará experiencia práctica para preparar cualquier dato para tus propios modelos de Machine Learning.

Supervised Learning with scikit-learn

Aprende a preparar datos para modelos de aprendizaje automático mediante la ingeniería de características.

Ingeniería de características para Machine Learning en Python

Crea nuevas funciones para mejorar el rendimiento de tus modelos de machine learning. 

N-grams

Create Your Free Account