Exercise

# Using bins

One of the difficulties in working with a binary response variable is understanding how it "changes." The response itself (\(y\)) is *either* 0 or 1, while the fitted values (\(\hat{y}\))—which are interpreted as probabilities—are *between* 0 and 1. But if every medical school applicant is either admitted or not, what does it mean to talk about the *probability* of being accepted?

What we'd like is a larger sample of students, so that for each GPA value (e.g. 3.54) we had many observations (say \(n\)), and we could then take the average of those \(n\) observations to arrive at the estimated probability of acceptance. Unfortunately, since the explanatory variable is continuous, this is hopeless—it would take an infinite amount of data to make these estimates robust.

Instead, what we can do is put the observations into *bins* based on their GPA value. Within each bin, we can compute the proportion of accepted students, and we can visualize our model as a smooth logistic curve through those binned values.

We have created a `data.frame`

called `MedGPA_binned`

that aggregates the original data into separate bins for each 0.25 of GPA. It also contains the fitted values from the logistic regression model.

Here we are plotting \(y\) as a function of \(x\), where that function is $$ y = \frac{\exp{( \hat{\beta}_0 + \hat{\beta}_1 \cdot x )}}{1 + \exp( \hat{\beta}_0 + \hat{\beta}_1 \cdot x ) } \,. $$ Note that the left hand side is the expected probability \(y\) of being accepted to medical school.

Instructions

**100 XP**

- Create a scatterplot called
`data_space`

for`acceptance_rate`

as a function of`mean_GPA`

using the binned data in`MedGPA_binned`

. Use`geom_line()`

to connect the points. - Augment the model
`mod`

. Create predictions on the scale of the response variable by using the`type.predict`

argument. - Use
`geom_line()`

to illustrate the model through the fitted values.