Session Ready
Exercise

Functions for Summarizing Data

Use the rxGetInfo(), rxSummary(), and rxHistogram() functions to summarize the flight data.

  • rxGetInfo() provides information about the dataset as a whole — how many rows and variables there are, the type of compression used, etc. Additionally, if you set the getVarInfo argument to TRUE, then it will also provide a brief synopsis of the variables. If you specify the numRows argument, it will also return the first numRows rows.

  • rxSummary() provides summaries for individual variables within a dataset.

  • rxHistogram() will provide a histogram of a single variable.

Instructions
100 XP

First, we will use rxGetInfo() in order to extract meta-data about the dataset. The syntax of rxGetInfo() is:

  • rxGetInfo(data, getVarInfo, numRows)
    • data corresponds to the dataset that we would like to get information for.
    • getVarInfo is a boolean argument that determines whether meta-data regarding the variables is also returned (default = FALSE).
    • numRows determines the number of rows of the dataset that is returned (default = 0).

Use rxGetInfo() to summarize the myAirlineXdf dataset we created in the previous exercise. In this function call, obtain information on all the variables, and return the first ten rows of data.

Next, we will use rxSummary() to summarize a few variables in the dataset. The syntax of rxSummary() is:

  • rxSummary(formula, data)
    • formula specifies the variables that you want to extract. The variables you want to summarize should appear on the right-hand side of the formula, and under most circumstances, will be separated by a + symbol.
    • data corresponds to the dataset from which you would like to extract variables.

Summarize the variables - ActualElapsedTime, AirTime, DepDelay, and Distance.

Use rxSummary() to summarize ActualElapsedTime, AirTime, DepDelay, and Distance.

Finally, we will use rxHistogram() to visualize the distribution of one of our variables. The syntax of rxHistogram() is:

  • rxHistogram(formula, data, …)
    • formula specifies the variable that you want to visualize. In the simplest case, it will take the form of ~ variable.
    • data corresponds to the dataset from which you would like to extract variables.
    • … is used to represent additional variables that govern the appearance of the figure.

Use rxHistogram() to obtain a histogram for the variable DepDelay.

Build a second histogram in which the x axis is limited to departure delays between \(-100\) and \(400\) minutes. In this histogram, segment the data such that there is one segment for a minute of delay. When the histogram is plotted have only ten ticks on the \(x\)-axis.