Random sampling

1. Random sampling

We will now look at random sampling for survey data. Random sampling is important because it helps cancel out the effects of unobserved factors.

2. Sampling in survey data analysis

Sampling is a method in which we choose a subset of data from a larger population. In survey analysis, this means selecting the subset of respondents participating in our survey. This will help us make inferences about the larger population in study. With effective sampling, we enable working with more manageable numbers, since it is often not feasible to survey a whole population. When conducting sampling, the sample we choose must represent the larger population. Otherwise, we risk having sampling errors. One of the ways we minimize this error is through random sampling.

3. Random sampling

Random sampling is a selected subset of a population dataset where each member of the population has an equal chance of being selected. Using a random sampling method in survey data analysis reduces bias, or sampling error, involved in the sample, while still working with manageable sample size. Randomization is the best method to ensure high internal validity, meaning a high causal relationship between variables; and a high external validity, meaning we can apply our findings to a broader population.

4. .sample() method

Random sampling in pandas is carried out using the dot-sample function. It is applied to a DataFrame and has a lot of parameters, but we will focus on three. The n parameter allows us to select a random n rows of a survey dataset, the frac parameter allows us to get a fraction of the dataset as our sample, and the random_state parameter allows us to be able to split the data in a way that can be reproducible if we wanted. We typically only use one of either n or frac, not both simultaneously. All these parameters are the default, and we will walk through each step.

5. Random sampling example

Suppose we have a survey sent out to employees of firm ABC. From the survey, 1000 employees have shown interest for on-site work, in which 100 of them have to be selected for onsite work. One way firm ABC can choose their 100 onsite candidates is through random sampling. This way each employee has an equal chance of getting selected. This can be done in two ways in pandas. We can call the dot-sample function on our entire survey dataset, and specify n equals to 100.

6. Random sampling example

Or, we can call the dot-sample function on our entire survey dataset, and specify frac equal to zero-point-one, since 100 employees is ten percent of 1000 employees. By specifying the percentage we want, pandas calculates how many rows it should return automatically. Either way, these sampling methods will produce 100 sample employees for the onsite work position. We can tell that the sample has been randomized because the index values are out of order on the left-hand side of the sample dataset. It is possible that each sample could have more women than men, for example. This is an example of sampling error when producing a sample from a dataset. Some sampling error is inevitable.

7. Random sampling example

If we want to make sure our sample dataset is reproducible, we can assign a number to the random_state parameter, and make sure to use it each time we sample the dataset. For example, if we sample our survey data again, using either the n equals 100 parameter, or frac equals zero-point-one parameter, and indicate random_state equals 123 for each, our index values for each sample shows that the samples are the exact same.

8. Let's practice!

Let's practice what you know so far!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.