More Complex Transforms Using transformFuncs

Within rxDataStep() we can compute more complex transformations using the transformFunc argument.

Use rxDataStep() to compute a standardized (mean = 0, standard deviation = 1) credit score variable.

The syntax is: rxDataStep(data, outFile, transformFunc, transformObjects, …)

data - The input file to read from
outFile - The output file to write
transformFunc - A (user-defined) function that will compute complex transformations. This function is evaluated in a sterile environment, so any external information (like the global mean of a variable) will need to be passed via transformObjects
transformObjects - Objects that are needed for the transformFunc.

One of the most important principles when dealing with RevoScaleR functions is that data are only read in one chunk at a time. Because of this, when any given chunk is being read in, the algorithms only know about that particular subset of the data. If we want to standardize a variable based on the mean and standard deviation of the variable across the entire dataset, then the first thing we need to do is to compute the mean and standard deviation of the variable across the dataset. Fortunately, this is easy to do with rxSummary().

Use rxSummary() to compute the summary statistics for the creditScore variable in mortData, and assign the output of the call to rxSummary() to the object csSummary. Next print out those values. The object csSummary contains a number of elements that you can see with either names(csSummary) or with str(csSummary). The name of the element containing the actual summary statistics is sDataFrame.

Create a new variable meanCS and assign it the value of the first row of the Mean column in csSummary$sDataFrame.
Create a new variable sdCS and assign it the value of the first row of the StdDev column in csSummary$sDataFrame.

Now that we have extracted the mean and standard deviation, we are ready to write the function to use as transformFunc. This function should take a single argument which is a list that corresponds to the dataset, and it should return a list. In this case, we want to create a new element within that list (a new variable named scaledCreditScore), and we want to assign it the original credit score minus the mean credit score, divided by the standard deviation.

Once we have defined this new function scaleCS, we only have to specify two things in rxDataStep(). We have to specify the value of the transformFunc argument (scaleCS), and we have to specify the values of the myCenter and myScale variables used within it. The transformFunc function is evaluated in a sterile environment, so it needs to be explicitly told what those values are via the transformObjects argument. transformObjects takes the form of a list in which the names of the elements correspond to the variables that will be accessible to the transformFunc function, and the values are those that are available in the global environment. In this case, we should assign an element named myCenter the value of meanCS, and we should assign an element named myScale the value of sdCS.

Once you have run rxDataStep(), go ahead and run rxGetVarInfo() to get information about the new variable, and then run rxSummary() to make sure that the variable is in fact standardized (i.e. it has a mean = 0, and a std. deviation = 1).

Introduction

Data Exploration

Data Manipulation

Data Analysis

Exercise

More Complex Transforms Using transformFuncs

Instructions