Sorting and filtering summarized data

1. Sorting and filtering summarized data

In your last exercise,

2. by_country dataset

you created a dataset called by_country, containing one row for each country with the total number of votes and the percentage of votes that were yes. Now you might be interested in knowing which country voted yes the most or least often.

3. dplyr verb: arrange()

To discover this we’ll introduce one more dplyr verb: arrange. Arrange sorts a dataset based on one of its variables, in either ascending or descending order. This is useful for pulling a few interesting conclusions out of your data.

4. arrange()

Here, we could pipe by_country to the arrange operation, telling it to sort by the percent_yes column. We’d see that Zanzibar is the country that voted yes the least often in our dataset, followed by the United States. But we might also notice that Zanzibar only had two votes in our entire dataset, which means that 0% is basically meaningless! This is a very common way that summarized data can trip you up, and why you have to be careful about interpreting your results too quickly. To fix this, in your exercises you’ll have to filter the dataset to remove countries with a low total, just like you earlier used filter to remove vote rows we didn’t care about.

5. Transforming tidy data

Notice therefore that filter isn’t just useful for cleaning your raw data, but also for manipulating your summarized data. It’s therefore important to get comfortable using each of these dplyr verbs at all the stages of an analysis.

6. Let's practice!