The logistic distribution

1. The logistic distribution

In order to understand logistic regression, you need to know about the logistic distribution.

2. Gaussian probability density function (PDF)

Before we get to the logistic distribution, let's look at the Gaussian, or normal distribution. Hopefully, you are familiar with the famous "bell curve" of its probability density function, made with the dnorm function. For the purposes of regression, we care more about the area under this curve. By integrating the dnorm function - calculating the area underneath it - we get another curve, known as the cumulative distribution function.

3. Gaussian cumulative distribution function (CDF)

To get the cumulative distribution function, or CDF, you call pnorm instead of dnorm. The y-axis is near zero on the far left of the plot, and near one on the far right of the plot. This is a feature of the CDF curve for all distributions. When x has its minimum possible value, in this case minus infinity, y will be zero. When x has its maximum possible value, in this case infinity, y will be one. You can think of the CDF as a transformation from the values of x to probabilities. When x is one, the CDF is curve is at zero-point-eight-four. That means that for a normally distributed variable x, the probability that x is less than one is eighty-four percent.

4. Gaussian inverse CDF

Since the CDF transforms from x-values to probabilities, you also need a way to get back from probabilities to x-values. This is the inverse CDF. Here we have a new dataset with probabilities from nearly zero to nearly one. The inverse CDF is calculated with qnorm. The line plot you see is the same as the CDF plot from the previous slide, but with the x and y axes flipped.

5. Distribution function names

The function names for distribution curves all follow a pattern. The PDF function starts with "d", the CDF function starts with "p", and the inverse CDF starts with "q". Then the names end with an abbreviation of the distribution name. dlogis, plogis, and qlogis follow the same naming convention as the normal distribution functions.

6. glm()'s family argument

Performing linear regression with lm is the same as performing it with glm and setting the error distribution family to gaussian. Switching from linear regression to logistic regression is done by changing the family argument to binomial. So, what are these family arguments?

7. gaussian()

gaussian is a function. Calling it and wrapping the result in str shows the structure - it returns an object that contains several other functions. Between them, these functions contain all the details for turning a generalized regression into a specific type of regression like linear or logistic regression.

8. linkfun and linkinv

Two elements of the family object are especially important. The linkfun element provides a transformation of the response variables, and the linkinv element undoes that transformation. For the gaussian case, it is boring because each transformation is just the identity function: it returns the same value that you put into it, since no special transformation is needed. For logistic regression however, it becomes more exciting.

9. Logistic PDF

Here's the logistic distribution PDF. It looks a little bit like the Gaussian PDF, but the tails at the extreme left and right of the plot are fatter.

10. Logistic distribution

The CDF for the logistic distribution is also known as the logistic function. The two terms are interchangeable. It has a fairly simple equation: one divided by one plus e to the minus x. The inverse CDF is sometimes called the logit function; again, the terms are interchangeable. Its equation is the logarithm of p divided by one plus p. In order to see what these curves look like, you'll have to try the exercises.

11. Let's practice!

Let's look at those curves!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.