1. readxl (2)
Next to the path and sheet arguments that we discussed in the previous video,
2. read_excel()
there are also other arguments you can specify in the read_excel function. Have a look at this documentation; can you tell the meaning of these arguments?
path and sheets have no more secrets for you.
3. read_excel() - col_names
Then, there's col_names, which can be three things: TRUE, the default, which means what the first row of the Excel sheet contains the column names. It can also be FALSE, in which case R comes up with its own names. Finally, you can also set it to a character vector specifying the column names.
Excel can contain different data types, such as text, numerics and dates. How this converts to R classes is controlled in the col_types argument:
4. read_excel() - col_types
By default it is NULL, which has R guess the data types of the different columns. But you can also manually specify the col_types. Suppose you want to import all the columns of the first sheet as text. This command will do the trick:
The column classes we can see in the printout reveal that indeed, the pop_data variable contains two columns that are both character now, perfect.
Other keywords to enforce a type are "numeric", "date" and "blank". "numeric" and "date" are straightforward, but what about "blank"?
5. read_excel() - col_types
Well, if you use blank, you're simply ignoring that column. If we adapt the previous call, like this,
you'll see that the population column is missing. This "blank" keyword can come in handy when you have an Excel file with a lot of columns and you only need some of them in R.
6. read_excel() - skip
Next, there's the skip argument. It specifies the number of rows in your excel sheet R has to skip before actually importing the data. Let's say the first two rows in the first sheet of cities-dot-xlsx are not necessary for our analysis. To ignore them, we can simply set the skip argument to 2 and read_excel will not import these lines. we'll also have to set the col_names argument, because the first row with the column names is skipped as well.
If we try out this code, indeed 2 rows were skipped; the row containing the column names and the first observations. That leaves us with the last three observations.
7. Wrap-up
Actually, the excel_sheets and read_excel are the only functions that are available in the readxl package. These two functions and the customization possibilities are all you need to get started with your excel data in R. On top of all that, readxl is extremely fast.
You might have recognized many of the arguments of the readr package of the previous chapter. Hadley Wickham made reading data from different data sources pretty consistent. This can only make your job of importing data easier, right?
8. Let's practice!
Get some practice with it in the exercises!