Exercise

Using rxDataStep() to Transform Data

In this exercise, we will use rxDataStep() in order to compute some simple transforms on the mortgage dataset that I created in the prior video. The path to this dataset is stored in the variable mortData.

Instructions

100 XP

Before we start using rxDataStep(), go ahead and get some information on the object mortData by using rxGetInfo().

Use rxDataStep() to compute a few simple transforms.

The syntax is: rxDataStep(inData, outFile, transforms, …)

  • inData - The input data set that you would like to use.
  • outFile - The output dataset that you would like to create. If this is the same as inData, then the overwrite argument must be set to TRUE, and if you want to create a new variable, the append argument should be set to "cols"
  • transforms - A list specification of the (simple) transforms you would like to compute.
  • … - Additional arguments.

The key to using rxDataStep() to compute simple transformations is to understand how to use the transforms argument. Generally, transforms will be a list in which the elements are named the new variables you want to create, and they take the values that you would like to assign. For example, if I wanted to create a new variable called highDebtRow in which I tag individuals with more than $8,000 in credit card debt, I could specify transforms as:

transforms = list(highDebtRow = ccDebt > 8000)

Go ahead and use rxDataStep to create a new variable highDebtRow that is TRUE for rows with ccDebt greater than 8000, and FALSE otherwise. Note: Usually, you can specify outFile as the same file as inData. In this case, create a new variable called myMortData, and store the file in the current working directory as "myMD.xdf".

Once you are done, use rxGetVarInfo() to make sure your new variable exists in the new file, and that it has an appropriate range.

Next, use rxSummary() to find out the proportion of observations that have high debt by this criterion.

Because the transforms argument is a list, we can have multiple, comma-separated transforms listed in the same transforms step.

Go ahead and run another call to create two more variables in myMortData, in which the transforms argument is used to compute two transformations:

  • newHouse = houseAge < 10
  • ccsXhd = creditScore * highDebtRow

The first simply creates a variable that tags new houses as new, and the second constructs the interaction term between credictScore and our new high debt variable. In this case, we can go ahead and use the same values for both inData and outFile. Just remember that you want to use the append argument, and that you must set overwrite to TRUE.