Introduction to the data

1. Introduction to the data

Hello and welcome to this course on communicating with data in the tidyverse. My name is Timo Grossenbacher and I'll be your instructor.

2. This is me

Now, you can't see me, that's why I included a picture of me. I work as a data journalist, who is basically someone who tries to find stories in data sets. If you want to see concrete examples of my work, the R code I write, go to srfdata.github.io. You see, I use a lot of R code at work, and naturally, communicating results is an important part of it.

3. The last step in the Tidyverse process

And communicating is also part of the Tidyverse paradigm, which you have seen in previous courses covering the Tidyverse. Actually, it's the last step in the data science workflow. Sometimes, this step is a bit underestimated, but think about it: Your efforts might go completely unnoticed if you don't accurately and attractively communicate them. So we're not talking about exploratory visualization, we're talking about the communication of the most important points of your data science analysis.

4. What you are going to create

In this course, you are going to create the plot you see on the left side. Guess what? You are going to use ggplot2 for that, and only ggplot2. You already know this package from previous courses, but you probably didn't know how it can be tweaked and customized like this. Actually this is what the plot looked like before applying a custom look to it. The end result is not only easier to read and understand, its aesthetics are also different and – as I would argue – more appealing.

5. Reporting in the Tidyverse

Apart from that, you are also going to create a report where you show the findings of your analysis, and embed your graphics. This is another cool thing about R – it can be used to automatically create custom and professional-looking reports like this one. Here you are going to learn how to do this.

6. The data you are going to work with

Throughout this course, you are going to work with two different data sets from the International Labour Organization, the ILO. Both contain indicators that are concerned with the international labour market. The first one gives working hours per week since 1980 for different countries. Each row in this data set represents the amount of weekly working hours per year per country.

7. The data you are going to work with

The other data set you are going to use contains another indicator, the so-called hourly compensation. Basically, it's the amount of compensation - that is, wages but also other benefits employees get for each non-working hour. Also in this data set, each row represents the amount of hourly compensation in US dollars per year per country. So for example, in 1980 people in Australia were given 8.44 USD of compensation for each non-working hour.

8. The inner_join() verb / function

As a first exercise, you are going to combine both data sets into one – so each row shows both indicators for each country and each year. For this, you are going to use the inner_join verb, also called function, from dplyr. So with the inner_join function, rows from two data sets are matched based on a common key. Rows that are not contained in both data sets are lost in this operation, so only matching rows are retained. Look at the example here: Only rows one and two are retained, because their keys exist in both data frames.

9. Let's do this!

Now, let's try this out and use the inner_join function to match both working hours and hourly compensation for each country and year.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Communicating with Data in the Tidyverse

BeginnerSkill Level

4.8+

98 reviews