In this chapter, you'll be introduced to the problem you'll be solving in this course. How do you accurately classify line-items in a school budget based on what that money is being used for? You will explore the raw text and numeric values in the dataset, both quantitatively and visually. And you'll learn how to measure success when trying to predict class labels for each row of the dataset.

Introducing the challenge

What category of problem is this?

What is the goal of the algorithm?

Exploring the data

Loading the data

Summarizing the data

Looking at the datatypes

Exploring datatypes in pandas

Encode the labels as categorical variables

Counting unique labels

How do we measure success?

Penalizing highly confident wrong answers

Computing log loss with NumPy

Exploring the raw data

In this chapter, you'll build a first-pass model. You'll use numeric data only to train the model. Spoiler alert - throwing out all of the text data is bad for performance! But you'll learn how to format your predictions. Then, you'll be introduced to natural language processing (NLP) in order to start working with the large amounts of text in the data.

It's time to build a model

Setting up a train-test split in scikit-learn

Training a model

Making predictions

Use your model to predict values on holdout data

Writing out your results to a csv for submission

A very brief introduction to NLP

Tokenizing text

Testing your NLP credentials with n-grams

Representing text numerically

Creating a bag-of-words in scikit-learn

Combining text columns for tokenization

What's in a token?

Creating a simple first model

Here, you'll improve on your benchmark model using pipelines. Because the budget consists of both text and numeric data, you'll learn to how build pipielines that process multiple types of data. You'll also explore how the flexibility of the pipeline workflow makes testing different approaches efficient, even in complicated problems like this one!

Pipelines, feature & text preprocessing

Instantiate pipeline

Preprocessing numeric features

Text features and feature unions

Preprocessing text features

Multiple types of processing: FunctionTransformer

Multiple types of processing: FeatureUnion

Choosing a classification model

Using FunctionTransformer on the main dataset

Add a model to the pipeline

Try a different class of model

Can you adjust the model or parameters to improve accuracy?

Improving your model

In this chapter, you will learn the tricks used by the competition winner, and implement them yourself using scikit-learn. Enjoy!

Learning from the expert: processing

How many tokens?

Deciding what's a word

N-gram range in scikit-learn

Learning from the expert: a stats trick

Which models of the data include interaction terms?

Implement interaction modeling in scikit-learn

Learning from the expert: the winning model

Why is hashing a useful trick?

Implementing the hashing trick in scikit-learn

Build the winning model

What tactics got the winner the best score?

Next steps and the social impact of your work

Learning from the experts

Data science isn't just for predicting ad-clicks-it's also useful for social impact! This course is a case study from a machine learning competition on DrivenData. You'll explore a problem related to school district budgeting. By building a model to automatically classify items in a school's budget, it makes it easier and faster for schools to compare their spending with other schools. In this course, you'll begin by building a baseline model that is a simple, first-pass approach. In particular, you'll do some natural language processing to prepare the budgets for modeling. Next, you'll have the opportunity to try your own techniques and see how they compare to participants from the competition. Finally, you'll see how the winner was able to combine a number of expert techniques to build the most accurate model.

Supervised Learning with scikit-learn

This course is a case study from a machine learning competition on DrivenData. Learn how to build a model to automatically classify items in a school budget.

Case Study: School Budgeting with Machine Learning in Python

Learn how to build a model to automatically classify items in a school budget.

Likely to Recommend

Writing out your results to a csv for submission

“Case Study: School Budgeting with Machine Learning in Python”

Exercise instructions

Hands-on interactive exercise

Case Study: School Budgeting with Machine Learning in Python

Chapter 1: Exploring the raw data

Chapter 2: Creating a simple first model

Chapter 3: Improving your model

Chapter 4: Learning from the experts

What is DataCamp?