Finding duplicates

A new update to the data pipeline feeding into ride_sharing has added the ride_id column, which represents a unique identifier for each ride.

The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the ride_sharing DataFrame.

In this exercise, you will confirm this suspicion by finding those duplicates. A sample of ride_sharing is in your environment, as well as all the packages you've been working with thus far.

Bu egzersiz

Cleaning Data in Python

kursunun bir parçasıdır

Kursu Görüntüle

Egzersiz talimatları

Find duplicated rows of ride_id in the ride_sharing DataFrame while setting keep to False.
Subset ride_sharing on duplicates and sort by ride_id and assign the results to duplicated_rides.
Print the ride_id, duration and user_birth_year columns of duplicated_rides in that order.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

# Find duplicates
duplicates = ____.____(____, ____)

# Sort your duplicated rides
duplicated_rides = ride_sharing[____].____('____')

# Print relevant columns of duplicated_rides
print(duplicated_rides[['____','____','____']])

Kodu Düzenle ve Çalıştır

Bu egzersiz

Cleaning Data in Python

kursunun bir parçasıdır

IntermediárioNível de habilidade

4.8+

Kursa Ücretsiz Başlayın

In this chapter, you'll learn how to overcome some of the most common dirty data problems. You'll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

Exercise 1: Data type constraints Exercise 2: Common data types Exercise 3: Numeric data or ... ?Exercise 4: Summing strings and concatenating numbers Exercise 5: Data range constraints Exercise 6: Tire size constraints Exercise 7: Back to the future Exercise 8: Uniqueness constraints Exercise 9: How big is your subset?Exercise 10: Finding duplicates

Geçerli Egzersiz

Exercise 11: Treating duplicates

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

Exercise 1: Membership constraints Exercise 2: Members only Exercise 3: Finding consistency Exercise 4: Categorical variables Exercise 5: Categories of errors Exercise 6: Inconsistent categories Exercise 7: Remapping categories Exercise 8: Cleaning text data Exercise 9: Removing titles and taking names Exercise 10: Keeping it descriptive

In this chapter, you'll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You'll also gain invaluable skills that will help you verify that values have been added correctly, and that missing values don't negatively impact your analyses.

Exercise 1: Uniformity Exercise 2: Ambiguous dates Exercise 3: Uniform currencies Exercise 4: Uniform dates Exercise 5: Cross field validation Exercise 6: Cross field or no cross field?Exercise 7: How's our data integrity?Exercise 8: Completeness Exercise 9: Is this missing at random?Exercise 10: Missing investors Exercise 11: Follow the money

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you'll then use your new skills to join two restaurant review datasets into one clean master dataset.

Exercise 1: Comparing strings Exercise 2: Minimum edit distance Exercise 3: The cutoff point Exercise 4: Remapping categories II Exercise 5: Generating pairs Exercise 6: To link or not to link?Exercise 7: Pairs of restaurants Exercise 8: Similar restaurants Exercise 9: Linking DataFrames Exercise 10: Getting the right index Exercise 11: Linking them together!Exercise 12: Congratulations!