1. Exploring distributions
Understanding the distribution of a variable is crucial for finding errors, outliers, and other anomalies in the data.
2. Count values
For columns with a small number of discrete values, we can view the distribution by counting the number of observations with each distinct value.
We group by, and order the results by, the column of interest. There are 20 distinct values in the unanswered_count column in the stackoverflow data with the tag amazon-ebs.
Only partial results are shown here. Twenty values are manageable to examine, but when the variable you're interested in takes on many different values, binning or grouping the values can make the output more useful.
3. Truncate
One way to do this is with the trunc function. Trunc is short for truncate. The trunc function reduces the precision of a number. This means replacing the smallest numeric places - the right-most digits - with zeros.
Truncating is not the same as rounding: you'll never get a result with a larger absolute value than the original number. Trunc takes two arguments: the value to truncate and the number of places to truncate it to.
Positive values for the second argument indicate the number of digits after the decimal to keep. For example, truncating 42-point-1256 to 2 places keeps only the first two digits after the decimal.
Negative values for the second argument indicate places before the decimal to replace with zero. For example, truncating 12,345 to -3 replaces the three digits to the left of the decimal with zero.
4. Truncating and grouping
We can use the trunc function to group values in the unanswered_count column into three groups based on the digit in the tens place of the number. Note that the second argument to the trunc function here is a -1. There are 74 values between 30 and 39.
5. Generate series
What if you want to group values by a quantity other than the place value of a number, such as by units of 5 or 20?
The generate_series function can help. It generates a series of numbers from a starting value to an ending value, inclusive, by steps of a third value.
6. Generate series
For example, we can generate a series from 1 to 10 by steps of 2, or a series from 0 to 1 by steps of 1/10th.
7. Create bins: output
generate_series can be used to group values into bins. Here's an example of what we want to create: a series of lower and upper values, and the count of the number of observations falling in each bin.
8. Create bins: query
Let's build the query to create that output. A WITH clause allows us to alias the results of a subquery to use later in the query. Here, we generate two series: one for the lower bounds of the bins and another for the upper. We name this "bins."
9. Create bins: query
Because we're only summarizing data for tag amazon-ebs, we also create that subset of the stackoverflow table and call it ebs.
10. Create bins: query
Then write the main select query to join the results of the subqueries we created and count the values. We join ebs to bins where the column unanswered_count is greater than or equal to the lower bound and strictly less than the upper bound. A left join keeps all bins in the result, even those with no values in them.
11. Create bins: query
Finally, group by the lower and upper bin values to count the values in each bin.
12. Create bins: output
Each row in the output has the count of days where the number of unanswered questions was greater than or equal to the lower bound and strictly less than the upper bound.
Note that the result contains bins with 0 values. This is because we counted non-null values of unanswered_count instead of just the number of rows.
13. Time to explore some distributions!
Alright, have fun exploring distributions!