Exercise

Build the winning model

You have arrived! This is where all of your hard work pays off. It's time to build the model that won DrivenData's competition.

You've constructed a robust, powerful pipeline capable of processing training and testing data. Now that you understand the data and know all of the tools you need, you can essentially solve the whole problem in a relatively small number of lines of code. Wow!

All you need to do is add the HashingVectorizer step to the pipeline to replace the CountVectorizer step.

The parameters non_negative=True, norm=None, and binary=False make the HashingVectorizer perform similarly to the default settings on the CountVectorizer so you can just replace one with the other.

Instructions

100 XP
  • Import HashingVectorizer from sklearn.feature_extraction.text.
  • Add a HashingVectorizer step to the pipeline.
    • Name the step 'vectorizer'.
    • Use the TOKENS_ALPHANUMERIC token pattern.
    • Specify the ngram_range to be (1, 2)