1. Learn
  2. /
  3. Courses
  4. /
  5. Cleaning Data in Python

Connected

Exercise

Treating duplicates

In the last exercise, you were able to verify that the new update feeding into ride_sharing contains a bug generating both complete and incomplete duplicated rows for some values of the ride_id column, with occasional discrepant values for the user_birth_year and duration columns.

In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average duration, and the minimum user_birth_year for each set of incomplete duplicate rows.

Instructions

100 XP
  • Drop complete duplicates in ride_sharing and store the results in ride_dup.
  • Create the statistics dictionary which holds minimum aggregation for user_birth_year and mean aggregation for duration.
  • Drop incomplete duplicates by grouping by ride_id and applying the aggregation in statistics.
  • Find duplicates again and run the assert statement to verify de-duplication.