Selecting the ideal dataset
Now to get rid of some of the unnecessary features in the ufo
dataset. Because the country
column has been encoded as country_enc
, you can select it and drop the other columns related to location: city
, country
, lat
, long
, and state
.
You've engineered the month
and year
columns, so you no longer need the date
or recorded
columns. You also standardized the seconds
column as seconds_log
, so you can drop seconds
and minutes
.
You vectorized desc
, so it can be removed. For now you'll keep type
.
You can also get rid of the length_of_time
column, which is unnecessary after extracting minutes
.
This exercise is part of the course
Preprocessing for Machine Learning in Python
Exercise instructions
- Make a list of all the columns to drop,
to_drop
. - Drop these columns from
ufo
. - Use the
words_to_filter()
function you created previously; pass invocab
,vec.vocabulary_
,desc_tfidf
, and keep the top4
words as the last parameter.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Make a list of features to drop
to_drop = [____]
# Drop those features
ufo_dropped = ufo.____
# Let's also filter some words out of the text vector we created
filtered_words = ____(____, ____, ____, ____)