ComenzarEmpieza gratis

EDA plots I

After generating a couple of basic statistics, it's time to come up with and validate some ideas about the data dependencies. Again, the train DataFrame from the taxi competition is already available in your workspace.

To begin with, let's make a scatterplot plotting the relationship between the fare amount and the distance of the ride. Intuitively, the longer the ride, the higher its price.

To get the distance in kilometers between two geo-coordinates, you will use Haversine distance. Its calculation is available with the haversine_distance() function defined for you. The function expects train DataFrame as input.

Este ejercicio forma parte del curso

Winning a Kaggle Competition in Python

Ver curso

Instrucciones del ejercicio

  • Create a new variable "distance_km" as Haversine distance between pickup and dropoff points.
  • Plot a scatterplot with "fare_amount" on the x axis and "distance_km" on the y axis. To draw a scatterplot use matplotlib scatter() method.
  • Set a limit on a ride distance to be between 0 and 50 kilometers to avoid plotting outliers.

Ejercicio interactivo práctico

Prueba este ejercicio y completa el código de muestra.

# Calculate the ride distance
train['distance_km'] = ____(train)

# Draw a scatterplot
plt.____(x=____[____], y=____[____], alpha=0.5)
plt.xlabel('Fare amount')
plt.ylabel('Distance, km')
plt.title('Fare amount based on the distance')

# Limit on the distance
plt.ylim(0, ____)
plt.show()
Editar y ejecutar código