Aan de slagGa gratis aan de slag

EDA plots I

After generating a couple of basic statistics, it's time to come up with and validate some ideas about the data dependencies. Again, the train DataFrame from the taxi competition is already available in your workspace.

To begin with, let's make a scatterplot plotting the relationship between the fare amount and the distance of the ride. Intuitively, the longer the ride, the higher its price.

To get the distance in kilometers between two geo-coordinates, you will use Haversine distance. Its calculation is available with the haversine_distance() function defined for you. The function expects train DataFrame as input.

Deze oefening maakt deel uit van de cursus

Winning a Kaggle Competition in Python

Cursus bekijken

Oefeninstructies

  • Create a new variable "distance_km" as Haversine distance between pickup and dropoff points.
  • Plot a scatterplot with "fare_amount" on the x axis and "distance_km" on the y axis. To draw a scatterplot use matplotlib scatter() method.
  • Set a limit on a ride distance to be between 0 and 50 kilometers to avoid plotting outliers.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Calculate the ride distance
train['distance_km'] = ____(train)

# Draw a scatterplot
plt.____(x=____[____], y=____[____], alpha=0.5)
plt.xlabel('Fare amount')
plt.ylabel('Distance, km')
plt.title('Fare amount based on the distance')

# Limit on the distance
plt.ylim(0, ____)
plt.show()
Code bewerken en uitvoeren