1. Machine learning basics
Now we'll cover the basics of Machine Learning.
This should be a recap of material that you've
already covered in previous DataCamp courses.
We'll start with the basics of how to fit and predict a model using scikit-learn.
2. Always begin by looking at your data
Before performing any data analysis, you should always take a look at your raw data.
This gives you a quick high-level take on the quality/kind of your data. In Numpy,
you can do so by printing out the first few rows of the data.
3. Always begin by looking at your data
In Pandas, this can be done by using the dot-head method, which shows the first five rows
and all columns by default.
4. Always visualize your data
It is also crucial to visualize your data. The proper visualization will depend on
the kind of data you've got, though histograms and scatterplots are a good place to start.
Look at the distribution of your data. Does it seem reasonable? Are there any outliers? Are you missing data?
Each of these questions is important to answer before doing any analysis.
5. Scikit-learn
Once you've gotten to know your data, it's time to start modeling it. The most popular
library for machine learning in Python is called "scikit-learn". It has a standardized
API so that you can fit many different models with a similar code structure. Here, we
import Support Vector Machine to classify datapoints.
6. Preparing data for scikit-learn
scikit-learn expects data to have a particular shape. Before using scikit-learn, your data should be two-dimensional.
The first axis should correspond to sample number, and the second should correspond to feature number. This pattern is used in almost all scikit-learn functions. If your data is not in this shape, there are a few options for reshaping it so that you can use it with scikit-learn.
7. If your data is not shaped properly
The most common approach is to "transpose" your data. This will swap
the first and last axis. This is most useful when your data is two-dimensional.
8. If your data is not shaped properly
Another option is to use the dot-reshape method, which lets you specify the shape you want.
9. Fitting a model with scikit-learn
Now that your data has the correct shape, it's time to fit a model. First we must
create an instance of the model we've imported (in this case, a support-vector classifier).
You can call the method dot-fit on this instance to train the model.
Here we show how you can input X (training data) and y (labels for each datapoint) to
fit the model.
10. Investigating the model
It is often useful to investigate what kind of pattern the model has found. Most models will store this information in attributes that are created after calling dot-fit.
Here we show the coefficients the model has given to each feature.
11. Predicting with a fit model
Once your model is fit, you can call the dot-predict method on the model to determine labels for unseen datapoints.
12. Let's practice
Now that we've practiced loading and preparing data, and fitting the model, it's time to put this into practice.