Moving forward when model assumptions are violated

1. Moving forward when model assumptions are violated

Before we embark on what to do when the model technical assumptions are violated, I'd like to remind you that you won't get accurate models or analyses if you remove outlying data points on a whim. If you want a model built on data that describes a population, you have to use all the data. In the previous example, however, we removed data values that were outside the range of interest. That is, we only considered points with fiber less than 15g. You removed points that were less than 4.6 in your hypothetical scenario — effectively subsetting the explanatory region that you describe. It will be important to explain such subsetting in the final analysis report.

2. Linear model

Recall the linear model that we've been working with. Note that X and Y have a linear relationship and the noise term gets added on. Also, recall that the noise values come from a normal distribution centered around zero. Transforming either the explanatory (X) or response (Y) variables will change the entire form of the model. In order to perform an inferential analysis to describe the variables, the technical conditions must be met. That means, we need a linear relationship between x and y. Sometimes, there doesn't exist a linear relationship between x and y but there does exist a linear relationship between a function of x or a function of y.

3. Transforming the explanatory variable

When the data come from a model which is a function of the explanatory variable, the model stays linear. However, it becomes linear in a function of X instead of linear in X itself. We say: y is a linear function of x and x-squared (together) or y is a linear function of the log of x. ... of the square root. But importantly, changing the values of X do not change the values of the residuals, as they are still modeled to be normally distributed and centered around zero with constant variance.

4. Squaring the explanatory variable

When the true model is given by Y versus the squareroot of X, we can linearize the model by squaring X, the explanatory variable. Note that the scatterplot on the right uses the square of the explanatory variable. Squaring the input variable is great for modeling data that are better fit by a curved line.

5. Transforming the response variable

Unlike transformations on the explanatory variable, when a response variable is transformed, the relationship between the linear model and the error terms is also changed.

6. A natural log transformation

Notice that the scatterplot on the right uses the natural log of the response variable. When the model is run after taking the log of the response variable, both the shape of the relationship and the increasing noise problem has been fixed! Using the log transform on the response is good for data where the variance is unequal.

7. Let's practice!

With many different possible transformations, there is not always just one right way to transform the data. For now, you will practice transforming variables and checking residual plots to determine whether the technical conditions are met.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.