1. Numeric Data Types and Summary Functions
In this chapter, we'll focus on numeric data. This includes both columns, or variables, that only take on integer whole number values and variables with decimal values.
2. Numeric types: integer
There are 10 different numeric data types. First, the integer types. The base type is called integer, which can be shortened to int. It allows whole numbers between approximately negative 2 billion and positive 2 billion.
3. Numeric types: integer
There are also smallint and bigint types that have different ranges.
4. Numeric types: integer
Serial types are used for integer columns that autoincrement. They are used to generate ID columns for data that doesn't already contain a unique identifier.
5. Numeric types: decimal
There are also 3 decimal types with different levels of precision. Decimal and numeric are two names for the same type. They can store numbers with very high precision. Precision refers to the number of digits in a number.
6. Numeric types: decimal
Real and double precision types can store numbers with less precision, meaning fewer digits in the number. One way in which column types matter is that functions and operators can work differently for different types of data.
7. Division
The most notable example is division. When you divide integers, the result is truncated to also be an integer. So integer 10 divided by integer 4 returns integer value 2. But integer 10 divided by numeric 4-point-0 returns 2-point-5. Now that we've covered the different data types, how do we start exploring numeric data?
8. Range: min and max
It's always good to check the range and summary statistics of the values in a column. Get the range with the min and max functions, which return the minimum and maximum values of their input respectively. Here, we take the min and max of the question_pct column in the stackoverflow table. The column tells us the proportion of total questions for a day with the specified tag.
9. Average or mean
The a-v-g function gives us the average or mean of values in a column.
10. Variance
Variance is a statistical measure of the amount of dispersion in a set of values. It tells you how far spread values are from their mean. Larger values indicate greater dispersion. Variance can be computed for a sample of data or for the population.
The formula is the same except that population variance divides by the number of values, while the sample variance divides by the number of values minus one.
The var_pop function computes population variance. The var_samp function computes sample variance. The sample variance will always be slightly larger than the population variance. The variance function is an alias for var_samp.
11. Standard deviation
Standard deviation is another measure of variance. It is the square root of the variance. Like variance, there are also functions for both sample and population versions of standard deviation.
12. Round
Functions can return results with many decimal places. To make results easier to read, use the round function to round a value of numeric type to a specified number of decimal places. The round function takes a numeric value or column as the first argument, and the number of decimal places to keep as the second argument.
13. Summarize by group
In addition to computing summary measures for entire columns, it's also good practice to summarize variables by groups in the data.
For example, in addition to summarizing the question_pct column in the stackoverflow table overall, we also want to compute summary measures for each tag. The output here is truncated. The numbers with an e in them are in scientific notation.
14. Let's work with numbers!
There are additional functions available, but these core functions are enough to get started. Now it's your turn to practice exploring numeric variables in the database.