Treating duplicates
In the last exercise, you were able to verify that the new update feeding into ride_sharing
contains a bug generating both complete and incomplete duplicated rows for some values of the ride_id
column, with occasional discrepant values for the user_birth_year
and duration
columns.
In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average duration
, and the minimum user_birth_year
for each set of incomplete duplicate rows.
This exercise is part of the course
Cleaning Data in Python
Exercise instructions
- Drop complete duplicates in
ride_sharing
and store the results inride_dup
. - Create the
statistics
dictionary which holds minimum aggregation foruser_birth_year
and mean aggregation forduration
. - Drop incomplete duplicates by grouping by
ride_id
and applying the aggregation instatistics
. - Find duplicates again and run the
assert
statement to verify de-duplication.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Drop complete duplicates from ride_sharing
ride_dup = ____.____()
# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': ____, 'duration': ____}
# Group by ride_id and compute new statistics
ride_unique = ride_dup.____('____').____(____).reset_index()
# Find duplicated values again
duplicates = ride_unique.____(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]
# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0