Get startedGet started for free

Bernoulli versus binomial distribution

1. Bernoulli versus binomial distribution

During this section, you'll learn about two closely related distributions: The Bernoulli and binomial. Understanding these distributions will help you understand the input options for binomial GLMs in R.

2. Foundation of GLM

These two distributions form the basis of logistic regression. The choice of distributions are closely related to the input of the data. I personally usually use a Bernoulli format most of the time, but use the binomial formats on occasions when my data structure guides me towards this format.

3. Bernoulli distribution

The Bernoulli distribution models a single event such as a coin flip. The expected or mean probability of this outcome depends upon the number of times we repeated the event, which is k, as well as the probability of the event. For example, flipping a fair coin produces the expected outcomes shown in this figure.

4. Binomial distribution

The Binomial distribution is closely related to a Bernoulli, but models multiple events occurring at the the same time. For example, we might model the number of heads expected if we flip multiple coins at the same time. The Binomial includes the same inputs as Bernoulli, but also includes the number of "trial", n. This example plot illustrates the expected number of heads if we were to flip 4 coins at once.

5. Simulating in R

In R, the binom distribution corresponds to the binomial distribution and rbinom() corresponds to the random number function. I use this function to simulate data to check my models and my scientific study designs. We will explore this more later in this chapter. The inputs includes n, which are the number of random numbers to generate; size, which are the number of trials; and p, the probability of success. If size equals one, then you are simulating a Bernoulli distribution.

6. GLM inputs options

In R, there are 3 formats for inputting data into a binomial GLM. The 1st is the long format, which corresponds to Bernoulli format and allows for predictor variables for each observation. The 2nd and 3rd are wide formats, and correspond to binomial formats. The 2nd uses an input matrix of successes and failures, which can easily be created using cbind. The 3rd uses the proportion of successes and corresponding weights or number of observations for each line. The wide inputs correspond to groups-level observations rather than individuals.

7. Example

As a toxicology example of wide versus long data, we might conduct a survival experiment where test organisms are either dead or alive at the end of the trial. Our data might be in long format if we have one entry per row and predictors for each individual. For example, we might have a dataframe with the status of individuals at the end of the study, their treatment group, and individual length. Conversely, we could also have our data in wide format. Wide data would have one group per line and predictors for each group. As an example data frame, might have the group, number dead, number alive, total, and group tank temperature. The Tidyverse or data.table packages have tools for converting between wide and long format.

8. Which input method to use?

Now you're probably wondering, which input option should I use? This largely depends upon the structure of your data. Is it long with observations per individual? Or wide with observations per group? Also, are you observations at the group-level or individual-level? Last, are you interested in groups or individuals? For example, if we were studying the probability of eggs hatching in a nest, are we interested in nest success or the success of individual eggs? Data structure usually drives my choice.

9. Let's practice!

Now, let's examine these three methods with R!