1. Introduction
Hi and welcome the first course in DataCamp's data visualization with ggplot2 series!
2. Your instructor - Rick Scavetta
My name is Rick Scavetta and I'll be the instructor for this series.
I've been training scientists on how to better understand and visualize their data since 2012. I'm very excited to bring my experience to DataCamp.
So what is data viz?
3. Data visualization & data science
Data visualization is an essential skill for data scientists. It combines statistics and design in meaningful and appropriate ways.
On the one hand, data vis is a form of graphical data analysis, emphasizing accurate representation and interpretation of data.
On the other hand, data vis relies on good design choices, not only to make our plots attractive, but to also aid both the understanding and communication of results.
On top of that, there is an element of creativity, since at it's heart, data vis is a form of visual communication.
4. Exploratory versus explanatory
It's important to understand the distinction between exploratory and explanatory visualizations.
Exploratory visualizations are easily-generated, data-heavy and intended for a small specialist audience, for example yourself and your colleagues - their primary purpose is graphical data analysis.
Explanatory visualizations are labor-intensive, data-specific and intended for a broader audience, e.g. in publications or presentations - they are part of the communications process.
As a data scientist, it's essential that you can quickly explore data, but you'll also be tasked with explaining your results to stake-holders.
Good design begins with thinking about the audience - and sometimes that just means ourselves.
5. MASS::mammals
This data set contains the average brain and body weights of 62 land mammals. To understand the relationship here, the most obvious first step is to make a scatter plot, like this one.
6. A scatter plot
Two mammals, the African and Asian Elephants have both very large brain and body weights, leading to a positive skew on both axes.
7. Explore with a linear model
Here, applying a linear model is a poor choice since a few extreme values have a large influence.
8. Explore: fine-tuning
A log transformation of both variables allows for a better fit.
So, although we began with a rough exploratory plot, that informed us about our data and lead us to a meaningful result.
9. Publication-ready plot
In the end, we'd probably want a cleaned-up explanatory plot.
10. Anscombe's plots
Here's a classic example from Francis Anscombe, first published in 1973.
When we imagine a linear model, as presented on this anonymous plot, we imagine that we are describing data that looks
11. Anscombe's plots
something like this. But this same model could be describing a very different set of data
12. Anscombe's plots
such as a parabolic relationship.
13. Anscombe's plots
which calls for a different model.
14. Anscombe's plots
or data in which an extreme value has a large effect.
15. Anscombe's plots
which becomes clear when the outlier is removed. And sometimes
16. Anscombe's plots
the model may be describing a relationship where in fact there is none at all
17. Anscombe's plots
because some extreme values may be incorrect.
18. Anscombe's plots
If we relied solely on the numerical output without plotting our data, we'd have missed distinct and interesting underlying trends.
We can see that data viz is rooted in statistics and graphical data analysis, but it's also a creative process that involves some amount of trial and error.
19. Let's practice!
Alright, enough examples, let's get our fingers moving with some exercises.