Initial EDA of AirBnB listings

1. Initial EDA of AirBnB listings

Let’s dive into the AirBnB dataset to understand more about the first three steps of the EDA process. First, I’ll import the CSV file into Power BI. No transformations are needed at this point, so just click “Load”. Now, let's become familiar with the variables and their data types by looking at the table in the Data view. We see “host_acceptance_rate” and ”host_total_listings_count” – two continuous variables – as well as “neighbourhood” and “city” – two categorical variables. I’m curious how many properties a host has, so let’s analyze the “host_total_listings_count” a bit further. Before doing so, let’s identify any missing values. In the _Report_ view, I can create a card visualization for a distinct count of “listing_id”. Then, adding “host_total_listings_count” as a page level filter, selecting “Blank” under Basic filtering, the resulting number of the card is the total count of missing values. There are 8 missing values for this variable. We can investigate if this is specific to any aspect of our dataset by exploring missing values across variables. Create a new text table on the page, adding again a distinct count of listing_id. I am curious if missing values are related to the city, so I’ll also add that. And I was correct! The missing values are found among 8 listings in Paris. Before we address these missing values, let’s add a table on a new page with the descriptive statistics. Create a new text table and add the distinct count of listing ids, the median for host_total_listings_count, and the average for host_total_listings_count. In the table, we can see the median is 1 and the average is 12. Typically, when the median is much lower than the average, the distribution of the variable is likely right-skewed. If you remember from the previous video, visually that is represented by a long small tail towards the right of a histogram. You will learn how to represent this graphically with histograms and box plots in the next lesson. What about Paris specifically? I can add another page filter to show just the listings in Paris. The median is still 1; the average is 6.33. Therefore, distribution would still be right-skewed though the average is lower. Now what happens if we use imputation to change missing values to the median value for Paris? To do this, create a new calculated column named “updated host total listing count”. Then write a DAX function which states, IF(the value of host_total_listings_count is blank, set the value to the median of 1 (which is pulled from the table), otherwise set it as the current value of the host_total_listings_count. If we add the median and average for the new updated host listings count variable to the text table, we can see that the median stays the same but the average drops slightly. This makes sense as median imputation would reinforce that number as the value at which most rows occur. If there were more missing values we needed to impute, the effect on average would be even more noticeable. Now it’s your turn to explore the price of these listings.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.