Histograms and outliers in AirBnB listings
1. Histograms and outliers in AirBnB listings
With missing values addressed, and some basic understanding of variables with descriptive statistics, we can now move on to the next step within the EDA process - visualizing distributions and identifying outliers. We saw most hosts have just one listing. How many listings have 1 bedroom? 2 bedrooms? To get an answer, I’ll create a new histogram, on a new page, using a clustered column chart and first add a distinct count of all listing IDs and then the number of bedrooms to the axis. Most listings only have 1 bedroom; 1 has 11 bedrooms! Ultimately, we are also interested in understanding the ratings of these listings and what characteristics influence this value. Let’s look at the distribution of the variable, review_scores_rating. I’ll again add a clustered column chart, with listing_id in the “Values” as a distinct count, and the reviews_scores_rating on the axis. As you can see, doing it in this way creates a histogram but is a bit too granular for our purposes. So, we can use binning, or grouping, within Power BI to make it a little bit more higher level. To do so, click on “More options” in the review_scores_rating column then “new group”. In the pop-up, in “Bin Type”, select “Number of Bins”. Power BI provides a default, 39. I like to start with 25 then modify from there based on the results in the histogram and level of granularity. I’ll update the existing histogram, by selecting it, then adding the new binned review_scores_rating and removing the old one. There is a more clear pattern of ratings being between 75 to 100. With a long tail down to the left all the way to zero. You also see a potential outlier of -50 and about 150 on the right. We can use the IQR method discussed in the previous video to add quantitative bounds to the histogram. To do so, I’ll create new calculated columns. The first will be called “25_percentile”. The DAX function will use “PERCENTILE.INC()” which means it will include all the values in the column. Add the airbnb review_scores_rating column and set “k”, the percentile number to 0.25. The next column will be called “75_percentile”. We use the same DAX formula and column review_scores_rating. But this time, “k” will be set to 0.75. Next, we will create the interquartile range, or again the difference between the 75th percentile and 25th. This formula will be the “75_percentile” column minus the “25_percentile” column. Before moving on, let’s take a look at these columns in the Data view. You’ll notice, they are the same number for the entire dataset, which is intended. Finally, with the IQR calculated, we can now create columns for the lower and upper boundaries for identifying outliers. Again, I’ll create new columns. The first will be called “lower_IQR_threshold” and will be equal to “25_percentile” minus 1.5 time the IQR value. We’ll do the same thing for the upper IQR threshold. This time, using the 75_percentile and adding 1.5 times the IQR value. We can add these thresholds to the histogram chart as lines to be indicators. To do so, select the histogram, go to the “Analytics” pane and under “X-Axis Constant Line” click “+ Add”. We will add the first first the lower threshold value, using “Average” as the summarization. We could select minimum and maximum and get the same intended value. Going back to “add”, we can do the same for the upper threshold value. Selecting “average”. With these indicator lines on the histogram, we can now see this value is a clear outlier as well as this one on the right. To address these, I’ll use winsorizing to replace these lower outliers to the 5th percentile and these upper outliers to the 95th percentile. I’ll create a new modified review_scores_rating column using nested IF statements. I’ll call it modified_ratings. This is equal to “IF reviews_scores_rating value is less than the 5th percentile, then set the value equal to the 5th percentile.” If not, I’ll use a second “IF” statment. If the value is greater than the 95th percentile, set the value to the 95th percentile. Otherwise, keep the same value. Let’s see what the resulting distribution looks like. I’ll create a new group for the modified_ratings variable. Going to “More options”, new group, changing the “Bin Type” to “Number of bins” and again setting to 25. I’ll add this new variable to the axis. Great! Now the outliers at -50 and 150 are addressed. It’s your turn to build histograms, identify outliers, and address any which may be present in the data.2. Let's practice!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.