Descriptive statistics
1. Descriptive statistics
We can load and slice DataFrames, but if we want to extract answers from the data, we will need to manipulate these slices and calculate summary statistics on them.2. Describe function
The describe function, which is imported with the DataFrames package, is used to summarize the data quickly. We pass the DataFrame into the function, which calculates a bunch of summary statistics for each column of data and returns these values as another DataFrame. Each row in the returned DataFrame corresponds to a column in the input DataFrame. The describe function calculates the minimum, mean, median, and maximum values of each column and shows the data type and the number of missing values in the column. This function is primarily designed to run on columns of numeric data. It isn't possible to calculate the mean of a column of strings. In the first row, we can see that the values for the mean and median are missing. The minimum and maximum values are the first and last values of the strings column when sorted alphabetically.3. Summary statistics on columns
Since the columns of the DataFrame are just arrays, we can use everything we know about arrays to process them. We can use functions from the Statistics package to calculate a column's mean, median, standard deviation, or variance. We just slice the column out and pass it into the function.4. Other builtin summary functions
Some extra functions we can use are the sum, minimum, and maximum functions. These functions are not part of the Statistics package. They are built into Julia, so we don't have to import them. These three functions all work on arrays and calculate exactly what their names say - the sum of the values and the minimum and maximum values in an array.5. Column operations
We can operate on columns just like we operate on arrays by adding, subtracting, multiplying, or dividing them. We can apply these operations between a column and a scalar or between two columns. This summary is almost the same as the summary we saw for array operations, except where we had arrays named a and b before, we now have columns named a and b.6. Calculating run speed
Let's say we wanted to calculate the average speed of our runs in kilometers per hour. The distance column of our DataFrame is measured in meters, and the time column is measured in minutes. We can create an array of the run distances converted to kilometers by dividing the distance column by one thousand. Then we can create an array of run times in hours by dividing the time column by sixty. These calculations return arrays.7. Calculating run speed
We divide the distances by the times to calculate the speeds. We can even add these values to the existing DataFrame.8. Column assignment
We can assign an array of values to a new column like so. We select the column from the DataFrame as if it already exists. Here, we set the name of the new column to "speed". Then we set this column equal to the array of values. Alternatively, we can assign the column using the dot-column-name syntax.9. Column assignment
When we print the DataFrame, we see that the new column has been added to the end. We can now perform operations on the new column like any other.10. Let's practice!
Let's move on to our next assignments in the exercises.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.