Get startedGet started for free

Selecting the ideal dataset

Now to get rid of some of the unnecessary features in the ufo dataset. Because the country column has been encoded as country_enc, you can select it and drop the other columns related to location: city, country, lat, long, and state.

You've engineered the month and year columns, so you no longer need the date or recorded columns. You also standardized the seconds column as seconds_log, so you can drop seconds and minutes.

You vectorized desc, so it can be removed. For now you'll keep type.

You can also get rid of the length_of_time column, which is unnecessary after extracting minutes.

This exercise is part of the course

Preprocessing for Machine Learning in Python

View Course

Exercise instructions

  • Make a list of all the columns to drop, to_drop.
  • Drop these columns from ufo.
  • Use the words_to_filter() function you created previously; pass in vocab, vec.vocabulary_, desc_tfidf, and keep the top 4 words as the last parameter.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Make a list of features to drop
to_drop = [____]

# Drop those features
ufo_dropped = ufo.____

# Let's also filter some words out of the text vector we created
filtered_words = ____(____, ____, ____, ____)
Edit and Run Code