Get startedGet started for free

Visualizing distributions

1. Visualizing distributions

Welcome back! In this video, we will explore the visualization of distributions using histograms. Let's dive right in and get started!

2. Onion and wheat prices in Kerala, India

We will analyze a dataset containing the prices per kilogram of onions and wheat in centers in Kerala, a state in south-west India. Throughout this chapter, we'll investigate the prices of different commodities in India. We aim to extract valuable insights from this data. We can, for example, calculate the mean prices by commodity, showing that onions are more expensive on average. Nevertheless, we can harness the power of visual representations to gain a deeper understanding.

3. Visualizing distributions with histograms

A histogram is a visual representation of the distribution of a variable. It uses bars to display the frequency of values within different ranges. The x-axis represents the variable of interest, while the y-axis represents the frequency within each bucket. Histograms help us understand patterns, outliers, and the overall shape of the data distribution. Here, we have an example of a histogram showcasing the distribution including prices of both onions and wheat in rupees, exhibiting a right skew.

4. Distribution of onion and wheat prices

To plot a histogram using Plots-dot-jl, we use the histogram function. The first argument should be the numerical column we want to visualize the distribution of, which, in this case, is the price column. Next, we can add a label and set the color of the bars. Here, we use darkseagreen1. This code creates the histogram of prices shown in the last slide.

5. Number of bins

We assign an integer value to the bins parameter to choose the number of bins in the histogram. In this case, we use 20 bins. Julia will interpret this as an approximate number to aim for. Using fewer bins gives a smoother representation of the distribution, as each bar in the histogram will have more counts.

6. Number of bins

For finer control over the number of bins, we can set bins to a vector of values, for example, a range vector with 75 elements. Using more bins results in a less smooth but more detailed representation of the distribution. Finding the right balance between detail and smoothness is essential to accurately represent the data trends without overlooking critical information.

7. Normalized histogram

The data in a histogram can be normalized to have the y-axis representing probabilities instead of frequency by setting the normalize argument to true.

8. Probability distribution

We can also include a density plot. For that, we need the StatsPlots-dot-jl library, which provides a high-level interface for statistical plotting. We call the density function on the price column of our DataFrame to add the probability distribution curve of prices.

9. Prices per commodity

To visualize the distributions of onion and wheat prices separately, we use the groupedhist function from StatsPlots-dot-jl, passing the price column as the first argument. The group argument contains the column of categories we want to visualize separately, in this case, the commodities. Following these steps, we create a grouped histogram where each bin displays side-by-side bars for each group.

10. Stacked histogram

We can also set the bar_position argument to stack to have the bars representing onion prices stacked on top of those representing wheat prices.

11. A subtle difference

Look at the histogram we generated. Notice that peak prices are similar, despite onions being more expensive on average. This is due to the long tail of high onion prices. To confirm this, we can examine the median prices, which are very similar for both onions and wheat. Hence, the tail of the distribution accounts for the difference in means. Our histogram has provided valuable insights that were not evident from basic statistics alone.

12. Let's practice!

That concludes this video! Enjoy applying your skills and creating insightful histograms in the upcoming exercises.