Get startedGet started for free

Numeric data types and summary functions

1. Numeric Data Types and Summary Functions

In this chapter, we'll focus on numeric data. This includes both columns, or variables, that only take on integer whole number values and variables with decimal values.

2. Numeric types: integer

There are 10 different numeric data types. First, the integer types. The base type is called integer, which can be shortened to int. It allows whole numbers between approximately negative 2 billion and positive 2 billion.

3. Numeric types: integer

There are also smallint and bigint types that have different ranges.

4. Numeric types: integer

Serial types are used for integer columns that autoincrement. They are used to generate ID columns for data that doesn't already contain a unique identifier.

5. Numeric types: decimal

There are also 3 decimal types with different levels of precision. Decimal and numeric are two names for the same type. They can store numbers with very high precision. Precision refers to the number of digits in a number.

6. Numeric types: decimal

Real and double precision types can store numbers with less precision, meaning fewer digits in the number. One way in which column types matter is that functions and operators can work differently for different types of data.

7. Division

The most notable example is division. When you divide integers, the result is truncated to also be an integer. So integer 10 divided by integer 4 returns integer value 2. But integer 10 divided by numeric 4-point-0 returns 2-point-5. Now that we've covered the different data types, how do we start exploring numeric data?

8. Range: min and max

It's always good to check the range and summary statistics of the values in a column. Get the range with the min and max functions, which return the minimum and maximum values of their input respectively. Here, we take the min and max of the question_pct column in the stackoverflow table. The column tells us the proportion of total questions for a day with the specified tag.

9. Average or mean

The a-v-g function gives us the average or mean of values in a column.

10. Variance

Variance is a statistical measure of the amount of dispersion in a set of values. It tells you how far spread values are from their mean. Larger values indicate greater dispersion. Variance can be computed for a sample of data or for the population. The formula is the same except that population variance divides by the number of values, while the sample variance divides by the number of values minus one. The var_pop function computes population variance. The var_samp function computes sample variance. The sample variance will always be slightly larger than the population variance. The variance function is an alias for var_samp.

11. Standard deviation

Standard deviation is another measure of variance. It is the square root of the variance. Like variance, there are also functions for both sample and population versions of standard deviation.

12. Round

Functions can return results with many decimal places. To make results easier to read, use the round function to round a value of numeric type to a specified number of decimal places. The round function takes a numeric value or column as the first argument, and the number of decimal places to keep as the second argument.

13. Summarize by group

In addition to computing summary measures for entire columns, it's also good practice to summarize variables by groups in the data. For example, in addition to summarizing the question_pct column in the stackoverflow table overall, we also want to compute summary measures for each tag. The output here is truncated. The numbers with an e in them are in scientific notation.

14. Let's work with numbers!

There are additional functions available, but these core functions are enough to get started. Now it's your turn to practice exploring numeric variables in the database.