Get startedGet started for free

Stack Overflow questions

1. Stack Overflow questions

So far in this course, you've learned to use six dplyr verbs.

2. The joining verbs

You've also seen how they can be applied to combine data across a number of tables describing LEGO toys. For this last chapter, you're going to apply everything you've learned to a different dataset, to see how these joining verbs are useful in a variety of circumstances. Specifically, you'll be looking at Stack Overflow questions and answers about the programming language R.

3. Stack Overflow questions

If you've been programming for a while, you may have come across Stack Overflow questions online. Each question comes with a score based on people voting up or voting down, and can have several tags, including one for R. As you'll see in the next lesson, each question can also have one or more answers.

4. The questions table

The questions table contains each of the almost 300,000 Stack Oveflow questions that are tagged with R, along with the date they were asked and their score. A positive score means people upvoted the question, a negative means they downvoted it. Some of the most interesting information we could get is what other tags, besides R, is on each of these questions. There are tags like dplyr, ggplot2, tidyr, and others that you might have run into in this or other courses. But to get that information, you'll have to do some joining.

5. The question_tags and tags tables

The question tags table matches each question, based on an id, to a tag, which also has an ID. Notice this is an intermediate table: it doesn't have any data on the question or the tag themselves. We saw this in tables like inventory sets in the LEGO dataset as well, and it's typical for data from web databases. The tags table, in turn, links tag IDs to tag names. You'll need to join both of those to question tags to learn useful insights about the data. This can be done with a sequence of inner joins.

6. Joining question_tags with questions

First, you'd join questions with the question tags table, based on their question id.

7. Joining tags

Second, you'd join the tags table into the result, matching tag id to id. We save this as questions with tags. Notice that since questions can have multiple tags, there are 500,000 question-tag pairs in this table, even though there were only 300,000 questions to start.

8. Most common tags

There's a lot you can learn from this joined data. For instance, you could find what the most common tags that appear on R questions are, using dplyr's count verb with sort equals true. Some of the most common tags on R questions include ggplot2, dataframe, Shiny, and of course, the package you're learning about in this course: dplyr.

9. Let's practice!

In the exercises, you'll do more joining and learn more insights from this new dataset. Let's get started!