Get startedGet started for free

Full duplicates

You've been notified that an update has been made to the bike sharing data pipeline to make it more efficient, but that duplicates are more likely to be generated as a result. To make sure that you can continue using the same scripts to run your weekly analyses about ride statistics, you'll need to ensure that any duplicates in the dataset are removed first.

When multiple rows of a data frame share the same values for all columns, they're full duplicates of each other. Removing duplicates like this is important, since having the same value repeated multiple times can alter summary statistics like the mean and median. Each ride, including its ride_id should be unique.

dplyr is loaded and bike_share_rides is available.

This exercise is part of the course

Cleaning Data in R

View Course

Exercise instructions

  • Get the total number of full duplicates in bike_share_rides.
  • Remove all full duplicates from bike_share_rides and save the new data frame as bike_share_rides_unique.
  • Get the total number of full duplicates in the new bike_share_rides_unique data frame.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Count the number of full duplicates
___

# Remove duplicates
bike_share_rides_unique <- ___

# Count the full duplicates in bike_share_rides_unique
___
Edit and Run Code