Get Started

Outliers

1. Outliers

So you've made some predictions and estimated roughly how accurate they are. Ultimately, however, all predictions will have a certain amount of error. So how can you be sure that the errors in your predictions don't create serious risks? This chapter will explore both the statistical and psychological aspects of risk. Considering risk from these perspectives can help us make better decisions by weighing the potential outcomes of our choices.

2. Defining risk

Traditionally we define risk as exposure to danger. When we think of risk in a statistical and psychological sense, we define two dimensions: likelihood and consequences. The likelihood of an event refers to how frequently we expect it to occur, while we define the consequences as the severity of the event occurring. Take the example of an earthquake in the abstract. Earthquakes are fairly infrequent, and are virtually unheard of in certain parts of the world. However, earthquakes can have disastrous consequences. Thinking of risk in this way can help us

3. Defining outliers

One risk in making predictions is that the data we use to predict don't resemble future circumstances. One reason for this could be the presence of outliers. Outliers are data points that fall outside the normal expected range of values and skew the data. They are inherently low likelihood, but they can have a significant impact, depending on the analysis. Consider home prices, for example. A couple outlying millionaire homeowners in an area could spike the average home price (which is why you'll often see the median home price reported instead of the mean). Website hits are another example; sometimes traffic spikes on a given page if it starts trending. Including these data points to make predictions can skew results higher or lower and make predictions less accurate.

4. SORTing outliers

The SORT() function can help order our data to see how much variation there is. Once sorted, we can easily see which values fall outside our expected range. The SORT function takes three arguments: the entire range of cells to sort the column by which you want to sort, and a true or false value that indicates whether you want to sort low-to-high or high-to-low. We can also sort by multiple columns by repeating these three arguments in order as many times as needed.

5. FILTERing outliers

Furthermore, the FILTER() function can be useful for removing outliers. It takes a range of cells to filter and one or more conditions to apply that remove rows from that range. For example, we can filter cells using greater than or less than operators, or we can search for events that occur only in the West precinct using the equals sign.

6. Data overview

The data we use in this chapter are a sample of car crashes recorded by the Nashville Police Department and are reported through Nashville's open data portal. We will explore the region, number of vehicles, and number of injuries for certain crashes, but the full dataset includes more specific locations, weather and lighting conditions, and more factors that contribute to car crashes.

7. Let's practice!

Now that you've explored the statistical and psychological elements of assessing risk, let's dig into the exercises.