Limiter vos variables

Comme vous l’avez constaté, utiliser CountVectorizer avec ses paramètres par défaut crée une variable pour chaque mot du corpus. Cela peut générer un nombre beaucoup trop important de variables, souvent avec très peu de valeur analytique.

Pour cela, CountVectorizer propose des paramètres permettant de réduire le nombre de variables :

min_df : n’utiliser que les mots présents dans plus de ce pourcentage de documents. Cela permet d’éliminer les mots extrêmes qui ne se généralisent pas d’un texte à l’autre.
max_df : n’utiliser que les mots présents dans moins de ce pourcentage de documents. Utile pour supprimer les mots très fréquents qui apparaissent dans tous les corpus sans apporter de valeur, comme « and » ou « the ».

Cet exercice fait partie du cours

<cours>Feature engineering pour le Machine Learning en Python</cours>

Instructions de l’exercice

Limitez le nombre de variables dans CountVectorizer en fixant le nombre minimal de documents dans lesquels un mot peut apparaître à 20 % et le maximum à 80 %.
Ajustez et appliquez le vectoriseur sur la colonne text_clean en une seule étape.
Convertissez ce tableau transformé (creux) en un tableau numpy de comptes.
Affichez les dimensions du nouveau tableau réduit.

Exercice interactif pratique

Essayez cet exercice en complétant ce code d’exemple.

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Specify arguements to limit the number of features generated
cv = ____

# Fit, transform, and convert into array
cv_transformed = ____(speech_df['text_clean'])
cv_array = ____

# Print the array shape
print(____)

Modifier et exécuter le code

Cet exercice fait partie du cours

<cours>Feature engineering pour le Machine Learning en Python</cours>

IntermédiaireNiveau de compétence

4.8+

Commencer le cours gratuitement

In this chapter, you will explore what feature engineering is and how to get started with applying it to real-world data. You will load, explore and visualize a survey response dataset, and in doing so you will learn about its underlying data types and why they have an influence on how you should engineer your features. Using the pandas package you will create new features from both categorical and continuous columns.

Exercise 1: Why generate features?Exercise 2: Getting to know your data Exercise 3: Selecting specific data types Exercise 4: Dealing with categorical features Exercise 5: One-hot encoding and dummy variables Exercise 6: Dealing with uncommon categories Exercise 7: Numeric variables Exercise 8: Binarizing columns Exercise 9: Binning values

This chapter introduces you to the reality of messy and incomplete data. You will learn how to find where your data has missing values and explore multiple approaches on how to deal with them. You will also use string manipulation techniques to deal with unwanted characters in your dataset.

Exercise 1: Why do missing values exist?Exercise 2: How sparse is my data?Exercise 3: Finding the missing values Exercise 4: Dealing with missing values (I)Exercise 5: Listwise deletion Exercise 6: Replacing missing values with constants Exercise 7: Dealing with missing values (II)Exercise 8: Filling continuous missing values Exercise 9: Imputing values in predictive models Exercise 10: Dealing with other data issues Exercise 11: Dealing with stray characters (I)Exercise 12: Dealing with stray characters (II)Exercise 13: Method chaining

In this chapter, you will focus on analyzing the underlying distribution of your data and whether it will impact your machine learning pipeline. You will learn how to deal with skewed data and situations where outliers may be negatively impacting your analysis.

Exercise 1: Data distributions Exercise 2: What does your data look like? (I)Exercise 3: What does your data look like? (II)Exercise 4: When don't you have to transform your data?Exercise 5: Scaling and transformations Exercise 6: Normalization Exercise 7: Standardization Exercise 8: Log transformation Exercise 9: When can you use normalization?Exercise 10: Removing outliers Exercise 11: Percentage based outlier removal Exercise 12: Statistical outlier removal Exercise 13: Scaling and transforming new data Exercise 14: Train and testing transformations (I)Exercise 15: Train and testing transformations (II)

Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created.

Exercise 1: Encoder du texte Exercise 2: Nettoyer votre texte Exercise 3: Caractéristiques textuelles de haut niveau Exercise 4: Comptage de mots Exercise 5: Compter les mots (I)Exercise 6: Compter les mots (II)Exercise 7: Limiter vos variables

Exercice actuel

Exercise 8: Du texte à DataFrame Exercise 9: Fréquence-terme – inverse de la fréquence des documents Exercise 10: Tf-idf Exercise 11: Examiner les valeurs Tf-idf Exercise 12: Transformer des données encore jamais vues Exercise 13: N-grammes Exercise 14: Utiliser des n-grammes plus longs Exercise 15: Trouver les mots les plus fréquents Exercise 16: Bilan