1
Exploring the raw data
Gratuito
In this chapter, you'll be introduced to the problem you'll be solving in this course. How do you accurately classify line-items in a school budget based on what that money is being used for? You will explore the raw text and numeric values in the dataset, both quantitatively and visually. And you'll learn how to measure success when trying to predict class labels for each row of the dataset.
2
Creating a simple first model
In this chapter, you'll build a first-pass model. You'll use numeric data only to train the model. Spoiler alert - throwing out all of the text data is bad for performance! But you'll learn how to format your predictions. Then, you'll be introduced to natural language processing (NLP) in order to start working with the large amounts of text in the data.
3
Improving your model
Here, you'll improve on your benchmark model using pipelines. Because the budget consists of both text and numeric data, you'll learn to how build pipielines that process multiple types of data. You'll also explore how the flexibility of the pipeline workflow makes testing different approaches efficient, even in complicated problems like this one!
4
Learning from the experts
In this chapter, you will learn the tricks used by the competition winner, and implement them yourself using scikit-learn. Enjoy!

Initializing

Preprocessing text features

Here, you'll perform a similar preprocessing pipeline step, only this time you'll use the text column from the sample data.

To preprocess the text, you'll turn to CountVectorizer() to generate a bag-of-words representation of the data, as in Chapter 2. Using the default arguments, add a (step, transform) tuple to the steps list in your pipeline.

Make sure you select only the text column for splitting your training and test sets.

As usual, your sample_df is ready and waiting in the workspace.

Import CountVectorizer from sklearn.feature_extraction.text.
Create training and test sets by selecting the correct subset of sample_df: 'text'.
Add the CountVectorizer step (with the name 'vec') to the correct position in the pipeline.
Fit the pipeline to the training data and compute its accuracy.