Get startedGet started for free

Modeling Real Data

1. Modeling Real Data

Previously, most of our lessons have used very simple, or simulated, data, so that we could focus more on the code, the concepts, and the math. But linear models become truly useful when applied to real data sets in science, medicine, engineering, and society. In this lesson, you will build linear models in specific contexts, with meaningful data, while applying some of the more powerful modules from the python data science ecosystem. For example, we will see more of stats-models, and introduce the `LinearRegression()` class from `scikit learn`. Finally, you'll see that some of the same tools that we've already used to find the optimal model parameters, can also be used to quantify the uncertainty in models. Because even our best model is never perfect.

2. Scikit-Learn

Scitkit-learn is a very powerful python library for machine learning, but also for linear models. We start by importing `LinearRegression()` from `sklearn.linear_models`. Then we initialize the general form of the model by calling LinearRegression with an input of fit_intercept=True: this is like specifying the general form of the model as `y = a0 + a1*x`, but without specific values for a0 or a1. Next, we load the data, and then prepare it by reshaping. This reshape is needed because scikit-learn is designed for consistent use with more general modeling. Our current case is the simplest form of a more general work flow. Next, we call the fit() method of the model object, passing in our data. This finds optimal values for a0 and a1 so that the model fits the data.

3. Predictions and Parameters

Next, although it's not needed to make predictions, we can access the fit model parameters, using array style of indexing of the model attributes `coeff` and `intercept`. This direct access, while a bit awkward, is the easiest way to extract these attributes to make a comparison to previous lessons. But more importantly, sci-kit learn has a very consistent interface to make predictions without referencing the parameters explicitly. Here we call the predict method to demonstrate.

4. statsmodels

Another fantastic tool for linear models in python is statsmodels. Let's review how we've used it in previous lessons and then take an extra step. We start by loading the data into numpy arrays, and then repacking them into a pandas DataFrame. Notice that built into a pandas DataFrame is part of the matplotlib library, so methods to visualize your data are very accessible. You can call the plot() method of the DataFrame object to preview the data. Finally, as before, we use the `ols()` and `fit()` methods to build the model from the data.

5. Uncertainty

In the context of real data, it becomes more important to consider the uncertainty in our model parameters and predictions. We can use statsmodels to obtain estimates of uncertainty for the model parameters. Here, as before, we extract the optimal values for the parameters, a0 and a1, the intercept and slope, respectively. But this time, we also extract an "error" or to be more precise, an "uncertainty" value for each parameter. In this way we can quantify the uncertainty to expect from our model. We'll see more of these later in the course.

6. Let's practice!

In some cases, scikit-learn will be more convenient; in other cases statsmodels will be your tool of choice. No one tool can offer the ideal solution for every problem, so it's great to know how to use several of the best. Now that you've seen demonstrations of using several python tools for modeling, the following exercises will help you get comfortable using them.