Aggregating partial duplicates
Another way of handling partial duplicates is to compute a summary statistic of the values that differ between partial duplicates, such as mean, median, maximum, or minimum. This can come in handy when you're not sure how your data was collected and want an average, or if based on domain knowledge, you'd rather have too high of an estimate than too low of an estimate (or vice versa).
dplyr is loaded and bike_share_rides is available.
Questo esercizio fa parte del corso
Cleaning Data in R
Istruzioni dell'esercizio
- Group
bike_share_ridesbyride_idanddate. - Add a column called
duration_min_avgthat contains the mean ride duration for the row'sride_idanddate. - Remove duplicates based on
ride_idanddate, keeping all columns of the data frame. - Remove the
duration_mincolumn.
Esercizio pratico interattivo
Prova a risolvere questo esercizio completando il codice di esempio.
bike_share_rides %>%
# Group by ride_id and date
___ %>%
# Add duration_min_avg column
mutate(duration_min_avg = ___ ) %>%
# Remove duplicates based on ride_id and date, keep all cols
___ %>%
# Remove duration_min column
___(-___)