haven

1. haven

One thing we haven't discussed yet, is data from other

2. Statistical Software Packages

statistical software packages. The most common

3. Statistical Software Packages

ones are SAS, short for Statistical Analysis Software, STATA, which stands for statistics and data, and SPSS, or the statistical package for social sciences. Which software people use depends on the field of study or personal preference.

4. Statistical Software Packages

SAS, for example, is one of the most wide spread Business Analytic software tools and is also commonly used in Biostatistics or Medical Sciences. On the other hand, STATA is a typical tool for Economists. SPSS is often used in Social Sciences, hence the name.

5. Statistical Software Packages

In the end, each software uses and produces their own file types. The most common extensions are listed here.

6. R packages to import data

No matter the package your data comes from, R is prepared for every file that'll come along! In the rest of this chapter, you will learn how to use two R packages that can import data from these software environments: haven and foreign. The first one is written by Hadley Wickham, the other one by the R core team. The foreign package has been around for a longer time, while haven is still in development today, which is late 2015. Wickham aims to provide a more consistent, easier to use and faster alternative to foreign. On the other hand, foreign supports more data formats. But let's not jump to conclusions here. In this video, I'll talk about haven some more, and in the next video I'll talk about foreign. After that, you can choose for yourself which package you prefer.

7. haven

So, the haven package. This package can deal with SAS, STATA and SPSS data files. It does this by wrapping around the ReadStat C library by Evan Miller. Just like readr and readxl, the package is extremely simple to use. You pass the path to the data file and an R data frame results. After you've installed haven with install-dot-packages, you can load it with the library function.

8. SAS data

Let's start with loading a SAS data file first. Suppose you have a file, 'ontime dot sas7bdat' in your current working directory. It contains data on the percentage of flights that arrived on time for several airlines in the US. To import this data as a data frame on time, you simply use the function read_sas and pass the path to the data file:

9. SAS data

If you print its structure, you'll see that each variable in the data frame also has a label attribute. If you're familiar with SAS, you know that you can label variables in SAS datasets. Well, it's these same labels that are also available inside R now.

10. SAS data

When simply printing ontime, you don't see any difference with a normal data frame without labels.

11. SAS data

If you use RStudio's View function to explore a dataset, though,

12. SAS data

you'll see the labels: Can you read the data here? In March of 1999, for example,

13. SAS data

it appears that around 79 percent of all Delta Airline flights were on time.

14. STATA data

Next up is STATA. Haven is able to import both Stata 13 and 14 files with the read_stata function. You can also use read_dta, which does exactly the same.

15. STATA data

Just like before, simply passing the path to the dot dta file will do the trick. Suppose that the same statistics on the us airlines are now available as a dot dta file, ontime dot dta, which is in your current working directory, you can try either one of these calls. The printout looks pretty familiar again, but there's something different here. The names of the Airlines are converted to numbers, they aren't character strings anymore. How did that come about?

16. STATA data

If you have a look at the class of the Airline column of ontime, it appears to be of class labelled. This is the R version of the labelled vector, a common data structure in other statistical packages. If you simply print this Airline column, you can see the airline names from before. R assigned numbers for each variable according to their alphabetical order. As you want to continue your analysis in R, it's a good idea to convert this vector to a standard R class, such as a factor.

17. as_factor()

Instead of the standard as dot factor function, of base R, you'll need 'haven's as underscore factor for this. This is the type of categorical variables we're used to. In this case, it might be even better to have simple characters for the airline names, as these are not really categories. The base R as dot character function can do this for you. Let's just place it around the previous call.

18. as_factor()

If you assign this result to the Airline column of ontime again, you've made the ontime data frame ready for some more analysis, with the names as simple character strings.

19. SPSS data

Last but not least, there is SPSS data. Here, you'll want to use read_spss. Based on the extension, haven will decide for you which function to call: read_por for dot por files, or read_sav for dot sav files. Let's once more load in the airline data, that's stored as a dot sav file in the datasets folder of our personal directory. Again, a data frame results. The Airline column is a so-called labelled vector again. The column names here are slightly different from before.

20. Statistical Software Packages

It should be clear by now: haven is incredibly easy to use and simply does what's it supposed to. Have a quick look at the summary here, now with the corresponding functions.

21. Let's practice!

Now go ahead and get some practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.