Finding duplicates
A new update to the data pipeline feeding into ride_sharing
has added the ride_id
column, which represents a unique identifier for each ride.
The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the ride_sharing
DataFrame.
In this exercise, you will confirm this suspicion by finding those duplicates. A sample of ride_sharing
is in your environment, as well as all the packages you've been working with thus far.
This exercise is part of the course
Cleaning Data in Python
Exercise instructions
- Find duplicated rows of
ride_id
in theride_sharing
DataFrame while settingkeep
toFalse
. - Subset
ride_sharing
onduplicates
and sort byride_id
and assign the results toduplicated_rides
. - Print the
ride_id
,duration
anduser_birth_year
columns ofduplicated_rides
in that order.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Find duplicates
duplicates = ____.____(____, ____)
# Sort your duplicated rides
duplicated_rides = ride_sharing[____].____('____')
# Print relevant columns of duplicated_rides
print(duplicated_rides[['____','____','____']])