1. Multiple plots from DataFrames
In this video, we will learn how to incorporate multiple plots within a figure using data from DataFrames. Let's get started!
2. Multiple variables in a plot
We already know how to add two variables to a single plot without the DataFrame recipe, but let's look at an example with the recipe.
First, we define a violin plot of insurance charges categorized by the sex of the policyholder, using the DataFrame plot recipe and customizing the plot.
Next, we can incorporate a box plot into the same figure using the same recipe notation, but this time appending an exclamation mark after the plotting function's name. This results in a figure showcasing the violin and box plots in a single plot.
3. Categorical data and layouts
Now, let's revisit our insurance dataset and explore its categorical columns, best visualized using side-by-side plots in a grid format.
To achieve this, we can create multiple subplots by utilizing the layout argument in Plots-dot-jl. Importantly, this approach is compatible with the recipe using the @df notation.
4. Layouts with DataFrames
Now, let's delve into an example.
To begin, we create a violin plot of insurance charges, grouping the data by sex and further distinguishing geographical regions. We add customizations to enhance the visualization.
We pass the layout argument to the plotting function to establish the desired layout. It can be set as an integer or a tuple representing the desired grid structure.
In this case, we opt for a tuple that generates a two-by-two grid of violin plots.
5. Adding chains to the mix
Chains offer a convenient way to manipulate data before plotting.
In this example, we define a chain for the insurance DataFrame, transforming the smoker column's "yes" and "no" values to the numerical representation of 100 or 0.
We then group the data by sex and children, calculating the mean of the smoker column to obtain the percentage of smokers in each group.
We then add a bar chart to visualize the percentage of smokers in the Smoker_mean column against the number of children, grouped by sex.
With a single chain, we seamlessly handle data manipulation and visualization, generating a grid of subplots for easy comparison.
6. Correlation matrix plots
A common way of visualizing numerical data in DataFrames is correlation matrix plots. Here, we show a correlation matrix plot of the age and BMI columns in the insurance DataFrame.
7. Correlation matrix plots
The correlation matrix plot displays various visualizations in one figure.
On the diagonal, histograms illustrate the distribution of each variable individually.
Above the diagonal, two-dimensional histograms showcase the relationship between each pair of variables.
Below the diagonal, scatter plots present the relationship between each pair of variables, accompanied by regression lines.
8. Correlation matrix plots in StatsPlots.jl
To generate a correlation matrix plot in StatsPlots-dot-jl, we invoke the corrplot function.
The first argument passed is a row vector comprising the names of the numerical columns as symbols.
We can assign symbols for color schemes to customize further the correlation matrix plot, which differ from the named colors we have been using. In this case, we set markercolor to thermal and fillcolor to acton. For additional information on color schemes, please refer to the Plots-dot-jl documentation provided below.
9. Correlation matrix plots in StatsPlots.jl
A correlation matrix plot can accommodate any number of numerical columns. We select the age, BMI, children, and charges columns for this example.
When specifying n columns, a plot grid of size n by n will be generated. In this case, we have a four-by-four grid to visualize the correlations between these variables.
10. Let's practice!
In this video, we gained valuable insights on constructing intricate visualizations using data from DataFrames. Mastering this skill is incredibly beneficial and calls for practice. Let's implement your knowledge by engaging in the exercises to further refine your abilities!