1. Splitting strings
When you are using str_detect, str_subset and str_count the pattern is of central interest, you want to know where it occurs or how often it occurs. In your next stringr function, str_split, the pattern is not directly of interest; it just provides a useful way to split the string into pieces.
2. str_split()
Take this string Tom & Jerry, we aren't really interested in
3. str_split()
the ampersand, but it provides a useful way to split the string into the two characters. We pass in the string and the pattern, and the output is a list with a vector of strings, one element for each part that was split, in this case, a vector with two elements, the string "Tom" and the string "Jerry". str_split will split into as many pieces as it can, based on how many times the pattern occurs, so here we end up with three parts. If you want to limit the number of splits you can specify the n argument. If we specify n is 2, the string will only be split at the first occurrence of the pattern and we get two parts back. You might be wondering why
4. str_split() returns a list
str_split returns a list. Well, if we pass in a vector of strings, there is no guarantee all the strings will be split into the same number of pieces, a list is the only structure that won't complain if each element is a different length. Of course, if you specify n or you know all the strings have the same number of pieces, you might want a simpler kind of output, like a matrix. You can ask for this kind of output
5. str_split() can return a matrix
by specifying simplify equals TRUE. You'll get a matrix back where each row corresponds to an input string and each column a piece of the split. There will be as many columns as the largest number of splits, and any strings with fewer splits will be padded with empty strings. Specifying both n and simplify guarantees you'll get no surprises in the dimension of the output: you'll always get n columns. Sometimes you'll want the variable length output.
6. Combing with lapply()
To process the un-simplified output further you'll generally use lapply or sapply or an equivalent. Remember lapply takes a list as it's first argument, and a function as its second. The function is applied to each element of the list, and a list is returned. For example, we might use lapply to find the length of each split, to tell us how many pieces each string was split into. I'd highly recommend learning about the purrr map functions that work in a similar way, you can learn about them in the "Writing functions in R" course I teach with my brother Hadley. You'll try str_split with
7. Let's practice!
both the simplified and unsimplified output in the following exercises.