N-gram range in scikit-learn
In this exercise you'll insert a CountVectorizer
instance into your pipeline for the main dataset, and compute multiple n-gram features to be used in the model.
In order to look for ngram relationships at multiple scales, you will use the ngram_range
parameter as Peter discussed in the video.
Special functions: You'll notice a couple of new steps provided in the pipeline in this and many of the remaining exercises. Specifically, the dim_red
step following the vectorizer
step , and the scale
step preceeding the clf
(classification) step.
These have been added in order to account for the fact that you're using a reduced-size sample of the full dataset in this course. To make sure the models perform as the expert competition winner intended, we have to apply a dimensionality reduction technique, which is what the dim_red
step does, and we have to scale the features to lie between -1 and 1, which is what the scale
step does.
The dim_red
step uses a scikit-learn function called SelectKBest()
, applying something called the chi-squared test to select the K "best" features. The scale
step uses a scikit-learn function called MaxAbsScaler()
in order to squash the relevant features into the interval -1 to 1.
You won't need to do anything extra with these functions here, just complete the vectorizing pipeline steps below. However, notice how easy it was to add more processing steps to our pipeline!
This is a part of the course
“Case Study: School Budgeting with Machine Learning in Python”
Exercise instructions
- Import
CountVectorizer
fromsklearn.feature_extraction.text
. - Add a
CountVectorizer
step to the pipeline with the name'vectorizer'
.- Set the token pattern to be
TOKENS_ALPHANUMERIC
. - Set the
ngram_range
to be(1, 2)
.
- Set the token pattern to be
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import pipeline
from sklearn.pipeline import Pipeline
# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# Import CountVectorizer
____
# Import other preprocessing modules
from sklearn.preprocessing import Imputer
from sklearn.feature_selection import chi2, SelectKBest
# Select 300 best features
chi_k = 300
# Import functional utilities
from sklearn.preprocessing import FunctionTransformer, MaxAbsScaler
from sklearn.pipeline import FeatureUnion
# Perform preprocessing
get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
# Instantiate pipeline: pl
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', ____(____=____,
____=____)),
('dim_red', SelectKBest(chi2, chi_k))
]))
]
)),
('scale', MaxAbsScaler()),
('clf', OneVsRestClassifier(LogisticRegression()))
])
This exercise is part of the course
Case Study: School Budgeting with Machine Learning in Python
Learn how to build a model to automatically classify items in a school budget.
In this chapter, you will learn the tricks used by the competition winner, and implement them yourself using scikit-learn. Enjoy!
Exercise 1: Learning from the expert: processingExercise 2: How many tokens?Exercise 3: Deciding what's a wordExercise 4: N-gram range in scikit-learnExercise 5: Learning from the expert: a stats trickExercise 6: Which models of the data include interaction terms?Exercise 7: Implement interaction modeling in scikit-learnExercise 8: Learning from the expert: the winning modelExercise 9: Why is hashing a useful trick?Exercise 10: Implementing the hashing trick in scikit-learnExercise 11: Build the winning modelExercise 12: What tactics got the winner the best score?Exercise 13: Next steps and the social impact of your workWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.