Get startedGet started for free

Aggregating partial duplicates

Another way of handling partial duplicates is to compute a summary statistic of the values that differ between partial duplicates, such as mean, median, maximum, or minimum. This can come in handy when you're not sure how your data was collected and want an average, or if based on domain knowledge, you'd rather have too high of an estimate than too low of an estimate (or vice versa).

dplyr is loaded and bike_share_rides is available.

This exercise is part of the course

Cleaning Data in R

View Course

Exercise instructions

  • Group bike_share_rides by ride_id and date.
  • Add a column called duration_min_avg that contains the mean ride duration for the row's ride_id and date.
  • Remove duplicates based on ride_id and date, keeping all columns of the data frame.
  • Remove the duration_min column.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

bike_share_rides %>%
  # Group by ride_id and date
  ___ %>%
  # Add duration_min_avg column
  mutate(duration_min_avg = ___ ) %>%
  # Remove duplicates based on ride_id and date, keep all cols
  ___ %>%
  # Remove duration_min column
  ___(-___)
Edit and Run Code