EDA plots I
After generating a couple of basic statistics, it's time to come up with and validate some ideas about the data dependencies. Again, the train
DataFrame from the taxi competition is already available in your workspace.
To begin with, let's make a scatterplot plotting the relationship between the fare amount and the distance of the ride. Intuitively, the longer the ride, the higher its price.
To get the distance in kilometers between two geo-coordinates, you will use Haversine distance. Its calculation is available with the haversine_distance()
function defined for you. The function expects train
DataFrame as input.
This exercise is part of the course
Winning a Kaggle Competition in Python
Exercise instructions
- Create a new variable "distance_km" as Haversine distance between pickup and dropoff points.
- Plot a scatterplot with "fare_amount" on the x axis and "distance_km" on the y axis. To draw a scatterplot use matplotlib
scatter()
method. - Set a limit on a ride distance to be between 0 and 50 kilometers to avoid plotting outliers.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Calculate the ride distance
train['distance_km'] = ____(train)
# Draw a scatterplot
plt.____(x=____[____], y=____[____], alpha=0.5)
plt.xlabel('Fare amount')
plt.ylabel('Distance, km')
plt.title('Fare amount based on the distance')
# Limit on the distance
plt.ylim(0, ____)
plt.show()