Get startedGet started for free

Top 100 routes dataset

1. Top 100 routes dataset

With the goal of being able to predict supply and demand of bikes at different stations, it is important for us to go beyond studying ride counts across all stations and to start getting a better feel for how ridership looks for different routes and stations in the dataset, investigating whether there are interesting route-specific behaviors that we should be aware of when building a predictive model.

2. Studying routes

There were 546 active BIXI bike stations in 2017, and we observe in the data nearly 190,000 unique station-to-station routes that were taken. While we could technically visualize all of these routes with Trelliscope (or at least ones with a non-trivial number of counts), for the sake of this course, we will just look at the top 100. As with many visualizations, some effort is required to construct the appropriate dataset.

3. Route frequency

To get to the top 100 routes, let's start by tabulating the number rides for each unique route in the data, with a unique route being defined as pairs of unique start and end stations. For now we will ignore "routes" that start and end at the same station. We see that there are more than 190 thousand routes that were observed, and that the top routes are ridden more than 2 thousand times over the course of the year. Note that we are now using the full 4 million ride dataset to construct this top 100 routes dataset.

4. Data for the top 100 routes

We can use our tabulated routes dataset to get the station codes for the top 100 routes and paste them together as a string to use in filtering the full bike data. We filter the bike data to only include start and end station code combinations that match that of the top 100 routes.

5. Getting ready for visualization

There are many route visualizations we can and should experiment with, and the "top100" dataset can be used as a basis to create many derivative datasets for different visualization purposes. We have already seen that it is quite useful to look at counts by hour-of-day, so let's continue that theme and construct a dataset for looking at hour-of-day counts for each of the top routes by workweek / weekend. To do this, we want to group by route (start and end station), the hour of day, and whether the day is a weekday or not. After tabulating the data, we have some datasets that contain additional metadata about the stations such as the station name and location, which will be useful for visualization. As a final step, we merge these datasets using left_join(), which automatically joins on variable names start_station_code and end_station_code that match across the two datasets.

6. Let's visualize!

Now that we've constructed a dataset for a specific task of visualizing routes, you'll visualize it in the next exercise!