Finding duplicates
A new update to the data pipeline feeding into ride_sharing has added the ride_id column, which represents a unique identifier for each ride.
The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the ride_sharing DataFrame.
In this exercise, you will confirm this suspicion by finding those duplicates. A sample of ride_sharing is in your environment, as well as all the packages you've been working with thus far.
Deze oefening maakt deel uit van de cursus
Cleaning Data in Python
Oefeninstructies
- Find duplicated rows of
ride_idin theride_sharingDataFrame while settingkeeptoFalse. - Subset
ride_sharingonduplicatesand sort byride_idand assign the results toduplicated_rides. - Print the
ride_id,durationanduser_birth_yearcolumns ofduplicated_ridesin that order.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
# Find duplicates
duplicates = ____.____(____, ____)
# Sort your duplicated rides
duplicated_rides = ride_sharing[____].____('____')
# Print relevant columns of duplicated_rides
print(duplicated_rides[['____','____','____']])