1. Learn
  2. /
  3. Courses
  4. /
  5. Preprocessing for Machine Learning in Python

Exercise

Selecting the ideal dataset

Now to get rid of some of the unnecessary features in the ufo dataset. Because the country column has been encoded as country_enc, you can select it and drop the other columns related to location: city, country, lat, long, and state.

You've engineered the month and year columns, so you no longer need the date or recorded columns. You also standardized the seconds column as seconds_log, so you can drop seconds and minutes.

You vectorized desc, so it can be removed. For now you'll keep type.

You can also get rid of the length_of_time column, which is unnecessary after extracting minutes.

Instructions

100 XP
  • Make a list of all the columns to drop, to_drop.
  • Drop these columns from ufo.
  • Use the words_to_filter() function you created previously; pass in vocab, vec.vocabulary_, desc_tfidf, and keep the top 4 words as the last parameter.