Get startedGet started for free

Exploring categorical variables with AirBnB data

1. Exploring categorical variables with AirBnB data

In this video, we’ll be exploring categorical variables using the AirBnB dataset. Specifically, I am interested in finding the city with the best accommodation and booking rates for a last-minute trip. First, we’ll just understand the number of listings per city. I’ll create a new table, first add city to “Values”, then listing_id as a distinct count. This table can be easily visualized as a bar chart to show the frequencies of each city within the dataset. Click on the “Stacked column chart” to create a new instance. Similarly, add city to “Axis” and listing_id as a distinct count to “Values”. Both Sydney and New York have just over 2,000 listings, which is more than Paris and Rome at around 1,500. How many people do these listings typically accommodate? Does this vary by city? We can answer this question by adding the accommodate variable to the table, then modify the summarization to be a median. Rome has a median of 3 while other cities have a median of 2. It can also be visualized using a bar chart. So, I’ll create a new Stacked column chart, adding city to “Axis” and accommodates as a median to Values. Now we can visually see the difference between cities. Say my partner and I are looking for an AirBnB for a possible last-minute weekend trip. Which city has the most listings that are instantly bookable? To explore this, I’ll add another variable to this chart showing number of listings by city. Specifically, the instant bookable variable to the “Legend”. True, or the instantly bookable listings, are represented by the dark blue. To see the proportion across each group of, I’ll change the chart to a 100% stacked column chart. Not only do the listings in Rome accommodate more people, but a higher proportion can be booked instantly. Good to know. What if we did need to book instantly, which cities have better acceptance rates from the hosts? One way we can answer this is through looking at the distribution of acceptance rates by city. Box plots are a great visualization. To create one, I’ll first add a new page, then click on “Box and Whisker” visualization by MAQ Software. Note: this visualization is not by default available in Power BI. It needs to be imported from the Power BI marketplace which can be accessed by clicking the three dots “...” in the Visualization pane, then choosing “Get more visuals”. For the exercises, it will be pre-loaded into the Power BI report. I’ll then add host_acceptance_rate to “Values” then city to “Axis category I”. Adding listing_id to “Axis” will help the visualization properly display the information at the individual listing aggregation. The y-axis isn’t formatted correctly. We can change that by going into the “Format” tab, then “Y-axis”. Here, we want to change the number of Decimal Places to two. Then let’s make sure the y-axis starts at 0 and ends at 1 since the host_acceptance_rate will only take values of 0 to 1, or 0% to 100% acceptance rates. Let’s revisit the components of box plots. In this Box and Whisker visualization, the dark area represents the distribution between the 1st and 2nd quartiles; the light area represents the distribution between the 2nd and 3rd quartiles. We can see the median, or the 2nd quartile, as the line in between the light and dark shaded areas. Hovering over each box will provide further metrics in the tooltip including IQR, minimum, and maximum. Looks like even though hosts in Rome are more likely to allow instant booking, they do not accept as often as hosts in New York or Sydney. Now’s it your turn to analyze categorical variables and build box plots using glassdoor reviews data.

2. Let's practice!