Get Started

N-gram range in scikit-learn

In this exercise you'll insert a CountVectorizer instance into your pipeline for the main dataset, and compute multiple n-gram features to be used in the model.

In order to look for ngram relationships at multiple scales, you will use the ngram_range parameter as Peter discussed in the video.

Special functions: You'll notice a couple of new steps provided in the pipeline in this and many of the remaining exercises. Specifically, the dim_red step following the vectorizer step , and the scale step preceeding the clf (classification) step.

These have been added in order to account for the fact that you're using a reduced-size sample of the full dataset in this course. To make sure the models perform as the expert competition winner intended, we have to apply a dimensionality reduction technique, which is what the dim_red step does, and we have to scale the features to lie between -1 and 1, which is what the scale step does.

The dim_red step uses a scikit-learn function called SelectKBest(), applying something called the chi-squared test to select the K "best" features. The scale step uses a scikit-learn function called MaxAbsScaler() in order to squash the relevant features into the interval -1 to 1.

You won't need to do anything extra with these functions here, just complete the vectorizing pipeline steps below. However, notice how easy it was to add more processing steps to our pipeline!

This is a part of the course

“Case Study: School Budgeting with Machine Learning in Python”

View Course

Exercise instructions

  • Import CountVectorizer from sklearn.feature_extraction.text.
  • Add a CountVectorizer step to the pipeline with the name 'vectorizer'.
    • Set the token pattern to be TOKENS_ALPHANUMERIC.
    • Set the ngram_range to be (1, 2).

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import pipeline
from sklearn.pipeline import Pipeline

# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Import CountVectorizer
____

# Import other preprocessing modules
from sklearn.preprocessing import Imputer
from sklearn.feature_selection import chi2, SelectKBest

# Select 300 best features
chi_k = 300

# Import functional utilities
from sklearn.preprocessing import FunctionTransformer, MaxAbsScaler
from sklearn.pipeline import FeatureUnion

# Perform preprocessing
get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', ____(____=____,
                                                   ____=____)),
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

This exercise is part of the course

Case Study: School Budgeting with Machine Learning in Python

IntermediateSkill Level
3.7+
7 reviews

Learn how to build a model to automatically classify items in a school budget.

In this chapter, you will learn the tricks used by the competition winner, and implement them yourself using scikit-learn. Enjoy!

Exercise 1: Learning from the expert: processingExercise 2: How many tokens?Exercise 3: Deciding what's a wordExercise 4: N-gram range in scikit-learn
Exercise 5: Learning from the expert: a stats trickExercise 6: Which models of the data include interaction terms?Exercise 7: Implement interaction modeling in scikit-learnExercise 8: Learning from the expert: the winning modelExercise 9: Why is hashing a useful trick?Exercise 10: Implementing the hashing trick in scikit-learnExercise 11: Build the winning modelExercise 12: What tactics got the winner the best score?Exercise 13: Next steps and the social impact of your work

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free