Get startedGet started for free

Putting it All Together with KittyCatch: Part 2 - Use Graphs to Understand the Outcome

As we saw in the previous exercise, the maximum value for our outcome of interest - the distance users walked - is much larger than the mean, median, and 3rd quartile of the dataset. This outlier might bias our t-test results by violating its assumption that the data is normally distributed, so we need to deal with outliers for this variable. Let's create some charts to see the current distribution of DistanceWalked, and then let's try a method to deal with the outliers and make another chart to see if our method makes our data look more like a normal distribution.

This exercise is part of the course

Causal Inference with R - Experiments

View Course

Exercise instructions

  • 1) Chart the values of our outcome of interest.
  • 2) Examine the highest distances that our sample users walked.
  • 3) Use top coding to help us handle outlier values of DistanceWalked.
  • 4) Create a new chart to see if Step 3 helps make our values look more normal.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# 1) Run a density plot on DistanceWalked to observe the long right tail in the distribution of our outcome of interest.

     plot(density())

# 2) Since the dataset is relatively small, let's use the `head` and `order` commands to examine the highest DistanceWalked observations in the dataset. The following syntax will order the top results by the outcome of interest. Use it as a guide; replace the dataframe and variable names to those we care about:

    head(Dataframe[order(Dataframe$Variable,decreasing=T),])

# Note: Notice that the four largest values for DistanceWalked are more than three times as large as the 5th largest value for DistanceWalked. These are clearly outliers. However, we should also note that all of these values are in the control group, so removing them from our dataset, as we had done with previous outliers, might also bias our estimates by lowering the mean values for the treatment group. Instead, let's say that our maximum distance we will allow is 2. This is known as top-coding. We will top-code these values so that they are still the largest values in the dataset, but so that they are not outliers and do not violate our assumptions of a normal distribution. 

# 3) Top coding can be arbitrary (and there are better methods for handling outliers), but for this assignment, use the following syntax as a model to create an ifelse statement that says that any value of DistanceWalked over 2 will be changed to equal 2: dataframe$variable<-ifelse(dataframe$variable>x,y,dataframe$variable). 
    
#Note: x is the current value of the variable, and y is the final value we want.
    
    

# 4) If you completed question 3 correctly, the distribution of DistanceWalked should now look much more normal. Create a new density plot of DistanceWalked to see if that is true.
    
    plot(density())
    
Edit and Run Code