1. Machine learning workflow
Welcome back!
2. Machine learning workflow
So far, we know that training data is used to let a model learn, then that model can be used to make predictions. But, what are the steps in between? In this video, we'll introduce the machine learning workflow, which are four steps that go into building a model.
3. Our scenario
We'll follow the steps with a scenario. New York City releases monthly records of all the apartments sold in the city.
It includes information on the sale like the square feet of the apartment, its neighborhood, the year built, and the price it was sold, to name a few.
We want to predict the price apartments will sell at, making our target the sale price. Since we have this labeled in our dataset, this is a supervised learning problem.
4. Step 1: Extract features
The first step is to extract features. Datasets don't typically come naturally with clear features, so there's work to be done in reformatting the dataset. Additionally, you need to decide what features you want to begin with. In our case, we mentioned a few, such as square feet and neighborhood, but there are more that could affect our target, like distance to the nearest subway station!
5. Step 2: Split dataset
After that we need to split the dataset into two datasets: the test and train dataset. The reason for doing this will become clear when in the last step. For now, keep in mind that there's two datasets!
6. Step 3: Train model
The third step is training the model.
7. Step 3: Train model
To do this, the train dataset is inputted into a chosen machine learning model.
There are many different machine learning models to choose from with different use-cases and levels of complexity. You may have heard of some examples of models, from a neural network to a logistic regression.
8. Step 4: Evaluate
Now we have a model and it needs to be evaluated! We can't assume the resulting model is going to be usable.
What would be the best way to evaluate the model? In our case, we would want to put the features of known sold apartments into the model and see how accurately it predicts the sale price. We don't want to use any data used to train the model, because the model has already seen that data. Luckily this is exactly what the test dataset is for!
9. Step 4: Evaluate
We put the test dataset, often called "unseen data", into the model to get the model's predictions.
There are many ways we could evaluate the performance of our model. For example, we could calculate the average error of the predictions or the percent of apartment sale prices that were accurately predicted within a 10% margin.
10. Step 4: Evaluate
Whatever metric is chosen, a performance threshold needs to be decided. For example, let's say our model is predicting 80% of the apartments accurately. Is that good enough?
11. Step 4: Evaluate
If yes, our model is ready to use!
12. Step 4: Evaluate
If not, we return to training the model, except we "tune it". Tuning can mean a couple different things, for example tweaking the model's options or features - we'll get more into that in chapter 2.
Tuning the model can take a while and if performance isn't improving, often times it means you don't have enough data.
13. Machine learning workflow
And that's the workflow! Don't worry, you don't have to have all the details memorized. These topics will be re-iterated throughout the course.
14. Summary of steps
In summary, we start by extracting features we want from our data. We then split the dataset into two for training and testing. Next, we train our model using the train dataset and a machine learning model.
Finally, we evaluate the model! If the performance isn't good enough, we tune and go back to step 3.
15. Let's practice!
Let's practice!