Exercise

Create some Decision Trees

Create a regression tree predicting default status using rxDTree().

Instructions

100 XP

Use rxDTree() to create a regression tree for the mortData dataset.

rxDTree() is the RevoScaleR function that computes both regression and classification trees. It decides which kind of tree to compute based on the type of the dependent variable used in its formula argument. If that variable is numeric, then it produces a regression tree; if that variable is a factor, then it produces a classification tree.

The syntax is: rxDTree(formula, data, maxdepth, …)

  • formula - The model specification
  • data - The data set in which to search for variables used in formula
  • maxdepth - The maximum depth of the regression tree to estimate.
  • … - Additional arguments

Go ahead and use rxDTree() in order to produce a decision tree predicting default status by credit score, credit card debt, years employed, and house age, using a maximum depth = 5. Use rowSelection to estimate this tree on only the data from the year 2000. Assign the output to an object named regTreeOut. This can take a few seconds, so be patient.

Once you have created this object, you can print it in order to view a summary and textual description of the tree. Go ahead and print regTreeOut and spend a couple of minutes looking at the output.

Although the text output can be useful, it is usually more intuitive to visualize such an analysis via a dendrogram. You can produce this type of visualization a few ways. First, you can make the output of rxDTree() appear to be an object produced by the rpart() function by using rxAddInheritance(). Once that is the case, then you can use all of the methods associated with rpart: Using plot() on the object after adding inheritance will produce a dendrogram, and then running text() on the object after adding inheritance will add appropriate labels in the correct locations. Go ahead and practice producing this dendrogram.

In most cases, you can also use the RevoTreeView library by loading that library and running createTreeView() on the regTreeOut object. This does not work in the datacamp platform, but will typically open an interactive web page where you can expand nodes of the decision tree by clicking on them, and you can extract information about each node by mousing over them.

Similar to the other modeling approaches we've seen, we can use rxPredict() in order to generate predictions based on the model object. In order to generate predictions, we need specify the same arguments as before: modelOject, data, and outData. Since we will create a new dataset, we will also need to make sure to write the model variables as well. And since we could generate predictions later as well, let's go ahead and give the name of the new variable something more specific. We can use teh predVarNames argument to specify the name of the predicted values to be default_RegPred. Go ahead and try this. Once you have created the variables, be sure to get information on your new variables using rxGetInfo() on the new dataset.

Another useful visualization for regression trees is the Receiver Operating Characteristic (ROC) curve. This curve plots the "Hit" (or True Positive) rate as a function of the False Positive rate for different criteria of classifying a positive or negative. A good model will have a curve that deviates strongly from the identity line y = x. One measure that is based on this ROC curve is the area under the curve (AUC), which has a range between 0 and 1, with 0.5 corresponding to chance. We can compute and display an ROC curve for a regression tree using rxRocCurve().

If we want to create a classification tree rather than a regression tree, we simply need to convert our dependent measure into a factor. We can do this using the transforms argument within the call to rxDTree() itself.