1. Plotting data in DataFrames
Welcome back! It is time to learn how to visualize data contained in one of Julia's most valuable data structures, namely DataFrames. Let's get started!
2. Insurance dataset
This chapter analyzes an insurance patient DataFrame. It provides us with personal information, including age, sex, BMI, number of children, and policyholders' yearly premium charges.
Throughout this course, we have utilized DataFrames, which offer flexibility and efficiency for working with data.
One essential tool we have yet to use is a plotting recipe from StatsPlots-dot-jl specifically designed to work with DataFrames.
This recipe introduces the @df notation, which we now explore to understand its benefits.
3. Extracting arrays from DataFrame
Before introducing the DataFrame recipe, let's pause to review how we have been plotting data from DataFrames thus far.
Suppose we want to plot the average yearly charges, grouped by region and smoker status.
First, we group the insurance DataFrame by these two columns.
Next, we employ the combine function to compute the mean charges.
Extracting data from a DataFrame column provides an array of the respective data; for example, the region column yields an array of regions as strings.
4. Plotting data in arrays
We can use the groupedbar function to visualize the data by providing the arrays we created, showing the regions on the x-axis, the mean charges on the y-axis, and grouping by smoker status.
We then customize the plot however we like.
It's worth noting that we only used arrays to create the plot. The source of the data, whether it originated from a DataFrame or not, didn't matter. We had to extract arrays from the DataFrame columns before plotting. There must be a more efficient approach!
5. Plotting from DataFrames directly
Fortunately, the StatsPlots-dot-jl recipe plots data directly from DataFrames.
To use it, we first write the @df command followed by the DataFrame's name and a call to our plotting function, in this case groupedbar.
Next, we specify the column names to be plotted on the x and y-axes.
We can also pass a column name to the group argument.
The remaining code remains unchanged. This code produces the same plot without extracting arrays from the DataFrame.
6. Side-by-side comparison
Let's compare these two approaches side-by-side.
The DataFrame recipe requires the @df notation preceding the plotting function call.
Without the recipe, we need to pass the arguments as arrays. In contrast, we can directly pass the column names with the recipe.
The remaining code remains the same in both cases. Notice how the recipe makes the code cleaner and avoids repetition.
7. Chaining DataFrame commands
Another benefit of the DataFrame recipe is its compatibility with chaining. Let's quickly review how to chain DataFrame operations.
Previously, we grouped the insurance DataFrame by region and smoker status and calculated the mean charges based on these groups.
With chaining, these operations can be combined into a single expression. For that, we use the @chain command followed by the name of the DataFrame. Inside a begin-end block of code, we specify the operations to be performed, in this case, groupby and combine.
8. Plotting chain
It is also possible to include a plotting function within a chain.
Starting from the previous chain, we add to the begin-end block the plotting function to be applied to the result of the data manipulation chain. Inside the chain definition, the @df command is followed directly by the plotting function without specifying the DataFrame name explicitly. This is because the DataFrame is implicitly defined within the chain.
This generates the same plot in a single operation; how efficient!
9. Let's practice!
In this video, we learned how to use some of the power of DataFrame visualizations. Let's get plotting!