1. What are the chances?
People talk about chance pretty frequently, like what are the chances of closing a sale, of rain tomorrow, or of winning a game? But how exactly do we measure chance?
2. Measuring chance
We can measure the chances of an event using probability. We can calculate the probability of some event by taking the number of ways the event can happen and dividing it by the total number of possible outcomes.
For example, if we flip a coin, it can land on either heads or tails. To get the probability of the coin landing on heads, we divide the 1 way to get heads by the two possible outcomes, heads and tails. This gives us one half, or a fifty percent chance of getting heads.
Probability is always between zero and 100 percent. If the probability of something is zero, it's impossible, and if the probability of something is 100%, it will certainly happen.
3. Assigning salespeople
Let's look at a more complex scenario. There's a meeting coming up with a potential client, and we want to send someone from the sales team to the meeting.
We'll put each person's name on a ticket in a box and pull one out randomly to decide who goes to the meeting.
4. Assigning salespeople
Brian's name gets pulled out. The probability of Brian being selected is one out of four, or 25%.
5. Sampling from a data frame
We can recreate this scenario in R using dplyr's sample_n function, which takes in a data frame and the number of rows we want to pull out, which is only 1 in this case.
However, if we run the same thing again, we may get a different row since sample_n chooses randomly. If we want to show the team how we picked Brian, this won't work well.
6. Setting a random seed
To ensure we get the same results when we run the script in front of the team, we'll set the random seed using set-dot-seed. The seed is a number that R's random number generator uses as a starting point, so if we orient it with a seed number, it will generate the same random value each time. The number itself doesn't matter. We could use 5, 139, or 3 million. The only thing that matters is that we use the same seed the next time we run the script. Now, we, or one of the sales-team members, can run this code over and over and get Brian every time.
7. A second meeting
Now there's another potential client who wants to meet at the same time, so we need to pick another salesperson. Brian already has been picked and he can't be in two meetings at once, so we'll pick between the remaining three. This is called sampling without replacement, since we aren't replacing the name we already pulled out.
8. A second meeting
This time, Claire is picked, and the probability of this is one out of three, or about 33%.
9. Sampling twice in R
To recreate this in R, we can pass 2 into sample_n, which will give us 2 rows.
10. Sampling with replacement
Now let's say the two meetings are happening on different days, so the same person could attend both. In this scenario, we need to return Brian's name to the box after picking it. This is called sampling with replacement.
11. Sampling with replacement
Claire gets picked for the second meeting, but this time, the probability of picking her is 25%.
12. Sampling with replacement in R
To sample with replacement, set the replace argument of sample_n to TRUE.
If there were 5 meetings, all at different times, it's possible to pick some rows multiple times since we're replacing them each time.
13. Independent events
Let's quickly talk about independence. Two events are independent if the probability of the second event isn't affected by the outcome of the first event. For example, if we're sampling with replacement, the probability
14. Independent events
that Claire is picked second is 25%, no matter who gets picked first.
In general, when sampling with replacement, each pick is independent.
15. Dependent events
Similarly, events are considered dependent when the outcome of the first changes the probability of the second.
If we sample without replacement, the probability that Claire is picked second depends on who gets picked first.
16. Dependent events
If Claire is picked first, there's 0% probability that Claire will be picked second.
17. Dependent events
If someone else is picked first, there's a 33% probability Claire will be picked second.
In general, when sampling without replacement, each pick is dependent.
18. Let's practice!
Head over to the exercises!