Get startedGet started for free

Text as data

1. Text as data

Hello, I'm Maham. Welcome to the course! For this course, I'll assume you are already familiar with the essential verbs or functions from dplyr and ggplot2. Together, we'll use these to cover wrangling text, visualizations, sentiment analysis, and topic modeling.

2. Using the tidyverse

dplyr

3. Using the tidyverse

and ggplot2 are part of the

4. Using the tidyverse

tidyverse, a collection of packages that follow the same principles and are designed to work well together. While there are other ways to approach data analysis in R, the tidyverse is incredibly powerful and approachable and is largely the reason why it's such a great time to learn R. In this course, we'll be analyzing text using the tidyverse and related packages.

5. Loading packages

To start using the tidyverse, you first need to load its packages using the library() function. Instead of loading each package separately, you can load the core collection of tidyverse packages using library(tidyverse). A list of the packages that are loaded and ready to use is printed for us. Here you can see dplyr and ggplot2, along with a number of other packages.

6. Importing review data

One of the packages we loaded was readr. You can use the read_csv() function to import data into R. Here we read in a comma-separated value (or CSV) file and assign it to review_data. If you print out review_data, you can see in the first line that the data is stored as a tibble, which is a type of data frame used by the tidyverse and related packages. You can see that there are 1,833 rows. You can also see that there are four columns: the date the review was written, the product being reviewed, the star rating each reviewer gave the product, and the review itself.

7. Using filter() and summarize()

Let's compute the average star rating for one of the products. To do this, we first pipe review_data into the filter() verb. We use the double equals to tell R we want this to be a comparison and put quotes around the name of the product that we want to keep reviews for. In this case we only want the rows where the product column is equal to the 650 Roomba model. We then pipe this filtered data frame into the summarize() verb and call the mean() function on the stars column and assign it to a new column called stars_mean using a single equals sign. This creates a new data frame composed of a single row and column with the average star rating for the 650 Roomba model.

8. Using group_by() and summarize()

We could repeat this process to compute the average star rating for other products, or we can just use the group_by() function in place of filter(). Here we use group_by() to specify which column defines the groups and pipe this into summarize(). The average star rating for the two products is nearly identical.

9. Unstructured data

We might naively try to similarly summarize the review column with a mean() and get an error. Text is data like the star rating, but it's currently unstructured. We'll need to add structure before we can analyze it.

10. Let's practice!

These dplyr functions will be essential moving forward. Let's solidify this review with some practice!