Inspecting the vectors
To get a better idea of how the vectors work, you'll investigate them by converting them into pandas
DataFrames.
Here, you'll use the same data structures you created in the previous two exercises (count_train
, count_vectorizer
, tfidf_train
, tfidf_vectorizer
) as well as pandas
, which is imported as pd
.
This exercise is part of the course
Introduction to Natural Language Processing in Python
Exercise instructions
- Create the DataFrames
count_df
andtfidf_df
by usingpd.DataFrame()
and specifying the values as the first argument and the columns (or features) as the second argument.- The values can be accessed by using the
.A
attribute of, respectively,count_train
andtfidf_train
. - The columns can be accessed using the
.get_feature_names()
methods ofcount_vectorizer
andtfidf_vectorizer
.
- The values can be accessed by using the
- Print the head of each DataFrame to investigate their structure. This has been done for you.
- Test if the column names are the same for each DataFrame by creating a new object called
difference
to see the difference between the columns thatcount_df
has fromtfidf_df
. Columns can be accessed using the.columns
attribute of a DataFrame. Subtract the set oftfidf_df.columns
from the set ofcount_df.columns
. - Test if the two DataFrames are equivalent by using the
.equals()
method oncount_df
withtfidf_df
as the argument.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create the CountVectorizer DataFrame: count_df
count_df = ____(____, columns=____)
# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = ____
# Print the head of count_df
print(count_df.head())
# Print the head of tfidf_df
print(tfidf_df.head())
# Calculate the difference in columns: difference
difference = set(____) - set(____)
print(difference)
# Check whether the DataFrames are equal
print(____)