Treating duplicates
In the last exercise, you were able to verify that the new update feeding into ride_sharing contains a bug generating both complete and incomplete duplicated rows for some values of the ride_id column, with occasional discrepant values for the user_birth_year and duration columns.
In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average duration, and the minimum user_birth_year for each set of incomplete duplicate rows.
Deze oefening maakt deel uit van de cursus
Cleaning Data in Python
Oefeninstructies
- Drop complete duplicates in
ride_sharingand store the results inride_dup. - Create the
statisticsdictionary which holds minimum aggregation foruser_birth_yearand mean aggregation forduration. - Drop incomplete duplicates by grouping by
ride_idand applying the aggregation instatistics. - Find duplicates again and run the
assertstatement to verify de-duplication.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
# Drop complete duplicates from ride_sharing
ride_dup = ____.____()
# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': ____, 'duration': ____}
# Group by ride_id and compute new statistics
ride_unique = ride_dup.____('____').____(____).reset_index()
# Find duplicated values again
duplicates = ride_unique.____(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]
# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0