Get startedGet started for free

Defining error and uncertainty

1. Defining error and uncertainty

Hi, everyone. My name is Evan Kramer, and I'm excited to discuss error and uncertainty with you. In this video, we will define basic terminology and key concepts and calculate some basic measures of variation. Let's get started.

2. We make errors

Imagine you're driving home and someone cuts you off. You honk and yell, cursing the driver and his rude behavior. But what if the person who cut you off is rushing a pregnant friend to the hospital? Would that change your perception? Our brains take shortcuts and make assumptions every day, which can lead to misperception. This course will distinguish errors in judgment from errors in statistics. Errors in judgment include our assumptions about how or why things happened. These reflect our experiences, biases, and psychological tendencies. Errors in statistics refer to measures of variation or uncertainty in the data. This also includes instances in which we draw incorrect conclusions from data.

3. Why care about error?

But why do we care about error? How about an example? Imagine you work for a fragrance company marketing colognes. You want to know what kind of advertising will be effective for your new product. You recruit a bunch of your friends, show them a couple of ads, and you find that they like one more than the other. You post that ad on various media and wait for the sales to roll in. But when your new cologne hits the market, no one buys it. Viral posts mock your ad campaign. What happened? Well, a couple of things could be behind this flop. First, you hadn't yet taken this course to determine whether your friends REALLY liked one ad better than the other. Second, perhaps your friends' apparent preference was just random noise in the data. It could be that if we asked a broader group of people (or asked them on a different day), we might find a different result. In this course, we'll talk a lot about statistical significance. This concept refers to a difference between groups that is unlikely due to random fluctuations. The second reason is sampling. You asked only your friends which ad they preferred, but they may not represent the entire population of people you want to buy your products. In data collection and analysis, we often rely on samples of data because collecting data from every person or situation is practically impossible.

4. Types of statistical errors

We'll cover two major types of statistical error: type I and type II error. Type I error occurs when a difference appears to be significant but isn't actually. We call these false positives. This could result from improper sampling, low data quality, or poor experimental design. Type II error is when a difference appears not to be significant but actually is significant. This could occur because of small sample sizes, large variation in the data, or poor data collection or measurement. Both of these errors can have important impacts on business decisions.

5. Measures of central tendency

We measure variation in the data to assess the amount of error. You should be familiar with the functions for calculating the mean, median, and mode of a dataset, which are AVERAGE(), MEDIAN(), and MODE(), respectively. We will also use the function for calculating the standard deviation, which is STDEV(). The standard deviation represents the average distance from the average value of a dataset. This gives us an idea of how much the data vary. Higher standard deviations indicate more variation in the data. These measures of central tendency are building blocks for us to dive deeper into other ways of assessing error later in the course.

6. Data overview

In this chapter, we will use a subset of data from the Seattle Police Department on crimes committed in the area. You can download the full dataset from Seattle's open data website. We have condensed the full dataset to include only the date and time of the crime, the precinct in which it occurred, and the type of offense. One note: the times in this dataset are stored in 24-hour format, which makes it easier to apply numerical functions.

7. Let's practice!

Now it's your turn to practice calculating measures of central tendency of our Seattle crime dataset.