Statistical inference and random sampling

1. Statistical inference and random sampling

We often know how to do things like compute averages and make graphs to describe our data, but what should happen next? How do we go from descriptive statistics to confident decision-making? How can we apply hypothesis tests to solve real-world problems?

2. Descriptive statistics

It all starts with a sample, as we rarely have access to the entire population. Given a sample, we likely want to compute some summary statistic to summarize our data. For example, consider this small set of daily S-and-P 500 closing prices. We could compare one day's close to the prior day's to get the daily change. We could then average them to see an average daily change of negative nine dollars and fourteen cents. The purpose of descriptive statistics like this are to summarize our sample.

3. Inference

On the other hand, statistical inference is the process of using a sample to infer something about our population. For example, we might use this sample statistic to conclude that, for our entire population of all trading days of the S-and-P 500, an average drop of nine dollars and fourteen cents is reasonable. Note that descriptive statistics only attempt to describe the data, whereas inference attempts to make conclusions and decisions from the data. This is an important distinction.

4. Statistical inference process

The process follows the order: sample, statistic, inference. We start with a sample, compute some statistic, and use that to infer the corresponding population statistic. By starting with a sample of customers, we may ask them how they feel about our new product. We will use their answers to infer how the entire population of possible customers would feel about our product. This process is called statistical inference.

5. Point estimates

The core of statistical inference is point estimates. A point estimate is a single value which serves as a best guess at an unknown population parameter. For example, we may be interested in how much the price of Bitcoin swings on any given day. In other words, we are interested in the population statistic of Bitcoin's average daily price swing. However, we may not have access to the entire trading history of Bitcoin, and have to work with a sample. This sample gives us a point estimate of 1158-point-95 Bitcoins. We can use this point estimate to give us a best guess of the population statistic of Bitcoin's average daily swing. Here we compute the difference between the daily high and low price of Bitcoin, and then look at the average of these swings. We see that an average daily swing of eleven hundred Bitcoins is typical.

6. Sampling

However, our point estimate depends on our sample. Different samples will yield different point estimates. Consider two different samples of one hundred trading days. We can take a sample of the first 100 days using iloc and slicing up to day 100. In that case the average daily swing is 659-point-60 Bitcoins. We could also pick a random starting day using numpy-dot-random-choice, and then select the next 100 days. We need to be careful to not select past the end of the data frame, so we will pick our starting day to avoid those last 100 rows. This gives an average daily swing of 943-point-83 Bitcoins. These are very different best guesses. Therefore, when trying to make inference, the first step is always to consider our sample carefully.

7. Let's practice!

In the following exercises we'll explore the effect of choosing different samples on statistical inference.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.