Selecting the ideal dataset
Now to get rid of some of the unnecessary features in the ufo dataset. Because the country column has been encoded as country_enc, you can select it and drop the other columns related to location: city, country, lat, long, and state.
You've engineered the month and year columns, so you no longer need the date or recorded columns. You also standardized the seconds column as seconds_log, so you can drop seconds and minutes.
You vectorized desc, so it can be removed. For now you'll keep type.
You can also get rid of the length_of_time column, which is unnecessary after extracting minutes.
This exercise is part of the course
Preprocessing for Machine Learning in Python
Exercise instructions
- Make a list of all the columns to drop,
to_drop. - Drop these columns from
ufo. - Use the
words_to_filter()function you created previously; pass invocab,vec.vocabulary_,desc_tfidf, and keep the top4words as the last parameter.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Make a list of features to drop
to_drop = [____]
# Drop those features
ufo_dropped = ufo.____
# Let's also filter some words out of the text vector we created
filtered_words = ____(____, ____, ____, ____)