Full duplicates

You've been notified that an update has been made to the bike sharing data pipeline to make it more efficient, but that duplicates are more likely to be generated as a result. To make sure that you can continue using the same scripts to run your weekly analyses about ride statistics, you'll need to ensure that any duplicates in the dataset are removed first.

When multiple rows of a data frame share the same values for all columns, they're full duplicates of each other. Removing duplicates like this is important, since having the same value repeated multiple times can alter summary statistics like the mean and median. Each ride, including its ride_id should be unique.

dplyr is loaded and bike_share_rides is available.

Cet exercice fait partie du cours

Cleaning Data in R

Afficher le cours

Instructions

Get the total number of full duplicates in bike_share_rides.
Remove all full duplicates from bike_share_rides and save the new data frame as bike_share_rides_unique.
Get the total number of full duplicates in the new bike_share_rides_unique data frame.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Count the number of full duplicates
___

# Remove duplicates
bike_share_rides_unique <- ___

# Count the full duplicates in bike_share_rides_unique
___

Modifier et exécuter le code

Cet exercice fait partie du cours

Cleaning Data in R

IntermédiaireNiveau de compétence

4.8+

Commencer le cours gratuitement

In this chapter, you'll learn how to overcome some of the most common dirty data problems. You'll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

Exercise 1: Data type constraints Exercise 2: Common data types Exercise 3: Converting data types Exercise 4: Trimming strings Exercise 5: Range constraints Exercise 6: Ride duration constraints Exercise 7: Back to the future Exercise 8: Uniqueness constraints Exercise 9: Full duplicates

Exercice en cours

Exercise 10: Removing partial duplicates Exercise 11: Aggregating partial duplicates

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

Exercise 1: Checking membership Exercise 2: Members only Exercise 3: Not a member Exercise 4: Categorical data problems Exercise 5: Identifying inconsistency Exercise 6: Correcting inconsistency Exercise 7: Collapsing categories Exercise 8: Cleaning text data Exercise 9: Detecting inconsistent text data Exercise 10: Replacing and removing Exercise 11: Invalid phone numbers

In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.

Exercise 1: Uniformity Exercise 2: Date uniformity Exercise 3: Currency uniformity Exercise 4: Cross field validation Exercise 5: Validating totals Exercise 6: Validating age Exercise 7: Completeness Exercise 8: Types of missingness Exercise 9: Visualizing missing data Exercise 10: Treating missing data

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

Exercise 1: Comparing strings Exercise 2: Calculating distance Exercise 3: Small distance, small difference Exercise 4: Fixing typos with string distance Exercise 5: Generating and comparing pairs Exercise 6: Link or join?Exercise 7: Pair blocking Exercise 8: Comparing pairs Exercise 9: Scoring and linking Exercise 10: Score then select or select then score?Exercise 11: Putting it together Exercise 12: Congratulations!