Get startedGet started for free

EDA plots I

After generating a couple of basic statistics, it's time to come up with and validate some ideas about the data dependencies. Again, the train DataFrame from the taxi competition is already available in your workspace.

To begin with, let's make a scatterplot plotting the relationship between the fare amount and the distance of the ride. Intuitively, the longer the ride, the higher its price.

To get the distance in kilometers between two geo-coordinates, you will use Haversine distance. Its calculation is available with the haversine_distance() function defined for you. The function expects train DataFrame as input.

This exercise is part of the course

Winning a Kaggle Competition in Python

View Course

Exercise instructions

  • Create a new variable "distance_km" as Haversine distance between pickup and dropoff points.
  • Plot a scatterplot with "fare_amount" on the x axis and "distance_km" on the y axis. To draw a scatterplot use matplotlib scatter() method.
  • Set a limit on a ride distance to be between 0 and 50 kilometers to avoid plotting outliers.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Calculate the ride distance
train['distance_km'] = ____(train)

# Draw a scatterplot
plt.____(x=____[____], y=____[____], alpha=0.5)
plt.xlabel('Fare amount')
plt.ylabel('Distance, km')
plt.title('Fare amount based on the distance')

# Limit on the distance
plt.ylim(0, ____)
plt.show()
Edit and Run Code