Tf-idf

Sebbene il conteggio delle occorrenze delle parole possa essere utile per costruire modelli, le parole che compaiono molte volte possono alterare i risultati in modo indesiderato. Per evitare che queste parole comuni dominino il tuo modello si può usare una forma di normalizzazione. In questa lezione userai la Term frequency–inverse document frequency (Tf-idf), come spiegato nel video. La Tf-idf riduce il peso delle parole comuni e aumenta quello delle parole che compaiono in pochi documenti.

Questo esercizio fa parte del corso

Feature Engineering per il Machine Learning in Python

Visualizza il corso

Istruzioni dell'esercizio

Importa TfidfVectorizer da sklearn.feature_extraction.text.
Crea un'istanza di TfidfVectorizer limitando il numero di feature a 100 e rimuovendo le stop word inglesi.
Adatta e applica il vettorizzatore alla colonna text_clean in un solo passaggio.
Crea un DataFrame tv_df che contenga i pesi delle parole e usa i nomi delle feature come nomi delle colonne.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

# Import TfidfVectorizer
____

# Instantiate TfidfVectorizer
tv = ____

# Fit the vectroizer and transform the data
tv_transformed = ____(speech_df['text_clean'])

# Create a DataFrame with these features
tv_df = pd.DataFrame(tv_transformed.____, 
                     columns=tv.____).add_prefix('TFIDF_')
print(tv_df.head())

Modifica ed esegui il codice

Questo esercizio fa parte del corso

Feature Engineering per il Machine Learning in Python

IntermediárioNível de habilidade

4.8+

Inizia il corso gratis

In this chapter, you will explore what feature engineering is and how to get started with applying it to real-world data. You will load, explore and visualize a survey response dataset, and in doing so you will learn about its underlying data types and why they have an influence on how you should engineer your features. Using the pandas package you will create new features from both categorical and continuous columns.

Exercise 1: Why generate features?Exercise 2: Getting to know your data Exercise 3: Selecting specific data types Exercise 4: Dealing with categorical features Exercise 5: One-hot encoding and dummy variables Exercise 6: Dealing with uncommon categories Exercise 7: Numeric variables Exercise 8: Binarizing columns Exercise 9: Binning values

This chapter introduces you to the reality of messy and incomplete data. You will learn how to find where your data has missing values and explore multiple approaches on how to deal with them. You will also use string manipulation techniques to deal with unwanted characters in your dataset.

Exercise 1: Why do missing values exist?Exercise 2: How sparse is my data?Exercise 3: Finding the missing values Exercise 4: Dealing with missing values (I)Exercise 5: Listwise deletion Exercise 6: Replacing missing values with constants Exercise 7: Dealing with missing values (II)Exercise 8: Filling continuous missing values Exercise 9: Imputing values in predictive models Exercise 10: Dealing with other data issues Exercise 11: Dealing with stray characters (I)Exercise 12: Dealing with stray characters (II)Exercise 13: Method chaining

In this chapter, you will focus on analyzing the underlying distribution of your data and whether it will impact your machine learning pipeline. You will learn how to deal with skewed data and situations where outliers may be negatively impacting your analysis.

Exercise 1: Data distributions Exercise 2: What does your data look like? (I)Exercise 3: What does your data look like? (II)Exercise 4: When don't you have to transform your data?Exercise 5: Scaling and transformations Exercise 6: Normalization Exercise 7: Standardization Exercise 8: Log transformation Exercise 9: When can you use normalization?Exercise 10: Removing outliers Exercise 11: Percentage based outlier removal Exercise 12: Statistical outlier removal Exercise 13: Scaling and transforming new data Exercise 14: Train and testing transformations (I)Exercise 15: Train and testing transformations (II)

Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created.

Exercise 1: Codificare il testo Exercise 2: Pulizia del testo Exercise 3: Feature testuali di alto livello Exercise 4: Conteggi di parole Exercise 5: Contare le parole (I)Exercise 6: Conteggio delle parole (II)Exercise 7: Limitare le tue feature Exercise 8: Da testo a DataFrame Exercise 9: Term frequency-inverse document frequency Exercise 10: Tf-idf

Esercizio in corso

Exercise 11: Esaminare i valori Tf-idf Exercise 12: Trasformare dati mai visti Exercise 13: N-grammi Exercise 14: Usare n-gram più lunghi Exercise 15: Trovare le parole più comuni Exercise 16: Riepilogo