1. Summary statistics
Welcome back!
2. Exploring data
Data exploration is a crucial step in the data analysis process.
It is used to characterize the data to better understand the values it contains, and identify potential data quality issues.
3. Summary statistics
Summary statistics are a vital tool in data exploration, as they allow us to describe our dataset at a macroscopic level.
There are three main groups of summary statistics: measures of frequency, which measure the counts of values; measures of center, which aim to describe what a typical value is; and measures of spread, which describe how spread out our values are.
We'll focus on measures of frequency and center.
4. Measures of frequency
The most common measure of frequency is the count, and in spreadsheets, this refers to the number of cells. There are several built-in functions for counting cells, so let's take a closer look at each of them using an example spreadsheet column containing five cells.
5. COUNT()
The COUNT function counts the number of cells in a range that contain numerical data, which also includes dates and currencies. In our example, the COUNT function will only count cells
6. COUNT()
A1, A2, and A5, as A3 contains text, and A4 is blank.
Passing this cell range as an argument to the COUNT function, returns three.
7. COUNTA()
The COUNTA function counts cells containing any data type, which also includes empty strings and errors.
In our example,
8. COUNTA()
every cell apart from A4 will be counted, as this cell is blank.
Calling COUNTA on this range, returns four.
9. COUNTBLANK()
The final counting function we'll cover is COUNTBLANK, which counts the number of blank cells in a range, including empty cells and cells containing empty strings.
Only cell A4 meets these criteria, so calling COUNTBLANK on the range returns one.
10. Measures of center
Measures of center aim to describe a typical value. The mean and median are the most common measures of center for numerical data.
The mean, sometimes referred to as the average, is defined as the sum of the values divided by how many there are.
The median is found by sorting the values and selecting the middle number. It's typically used when there are outliers present, which would disproportionately skew a mean calculation.
Let's calculate these summary statistics on a column of ages.
11. Mean
We can use the SUM function to calculate the sum of values, which returns 219.
As we've seen previously, we can use the COUNT function to find the count of numerical data. We can combine this with the SUM function using the division arithmetic operator,
and return a mean of 43-point-8.
Fortunately, spreadsheets provides a built-in function for calculating the mean.
12. AVERAGE()
Enter AVERAGE. If we pass the cell range to the AVERAGE function, we achieve the same result, without having to perform any arithmetic!
13. Median
To calculate the median,
14. Median
the list must be sorted. The median is the middle value,
15. MEDIAN()
found here.
This is easy enough when we have five rows, but with thousands of rows, we'll need some help.
The MEDIAN function calculates the median for any range of values, and returns the same result we found by hand.
16. Identifying data quality issues
Another key part of data exploration is identifying potential data quality issues.
This can be done using many of the functions we've already seen.
The output of COUNTBLANK can be compared to COUNT or COUNTA to determine if the amount of missing values is an issue.
Erroneous data is usually identified with a combination of subject matter expertise and data exploration.
The MAX and MIN functions return the maximum and minimum values in a range. If we're exploring age data, and find a minimum of negative-one and a maximum of 250, chances are we have an issue.
17. Let's practice!
Time to start exploring!