Get startedGet started for free

More summary functions

1. More Summary Functions

You've learned several functions to help you explore numeric data. Now it's time to add a few more.

2. Correlation

So far, we've summarized individual columns. But sometimes we want to understand the relationship between two columns. Correlation is one measure of the relationship between two variables. A correlation coefficient can range from 1 to -1, with larger values indicating a stronger positive relationship, and more negative values indicating a stronger negative relationship.

3. Correlation function

The corr function takes the names of two columns as arguments and returns the correlation between them. Rows with a null value in either column are excluded.

4. Median

Another common summary measure is the median. The median is the 50th percentile, or midpoint, in a sorted list of values.

5. Percentile functions

To get the median, use a percentile function. The syntax for the percentile functions is different than for other functions you've seen because the data must be ordered to do the computation. It's called ordered-set aggregate syntax. The only argument to the function is a number between 0 and 1 corresponding to the percentile you want. You then type "within group", and then, inside parentheses, order by and the name of the column you want to compute the percentile for. percentile d-i-s-c, or discrete, always returns a value that exists in the column. percentile c-o-n-t, or continuous, interpolates between values around the specified percentile. It can return a value that is not in the original data.

6. Percentile examples

Here's an example. We have four numbers: 1, 3, 4, and 5. The two percentile functions return different values for the median. The discrete percentile function returns 3, while the continuous percentile function interpolates between 3 and 4, to return 3-point-5. The formula used to compute percentiles is fairly complex, and sometimes the results may not be intuitive. In particular, you may be used to computing the median of an even number of values as the average of the two middle values. Be aware that these functions may not always return that value as the 50th percentile.

7. Common issues

Before you practice using these functions, there are a few common issues with numeric values that you should be on the lookout for. First, error codes. Sometimes certain values, such as 9, 99, or -99, may have special meaning and not be true data values. Check any documentation for your database, and be suspicious of values that seem out of place. Such special codes might also denote missing values. There are also additional special values, usually combinations of the letters N and A, that are used to denote missing values in other programs. It's also good to check to make sure that 0 really means 0 in the data and not missing. Beyond error or missing value codes, you also want to check for extreme outlier values. These may indicate data entry errors or other problems with the data. Finally, just because data is in a numeric column type, doesn't mean it should be treated as a number. Zip codes may be stored in a numeric column, but it doesn't make sense to take the average or variance of zip codes. Numeric values are also sometimes used to encode multiple choice responses to survey questions, even though the answer choices might not correspond to a numeric scale.

8. Let's practice!

With those common issues in mind, let's practice!