1. Exploring our data
Last lesson, we learned how E:R diagrams help us understand the structure of a database. But structure is only half of the equation; we also want to understand the actual data. This is where DATA EXPLORATION comes in.
2. Exploring with the console
The simplest way to explore data is to look at a preview, such as in a SQL client or in the DataCamp console. While this does quickly answer questions about what each field means, there are limitations to this approach.
3. Exploring with the console
First, since it’s only a preview of the dataset, it’s possible the preview does not provide an accurate picture. For example, when looking in the explorer at the summer_games table, you may only see NULL values for the bronze field. While it is useful to understand NULL values exist, it does not provide any information about existing values.
4. Exploring with the console
Second, previews do not provide any information on the distribution of values. For example, you may work for a company that works primarily in Europe. If the first few rows in the preview are North American countries, you may incorrectly conclude the company primarily works with North American clients.
5. Exploring with queries
A more robust approach to exploring the data is to run simple queries. By SELECTing the distinct values in the region field, we learn that the field is formatted with all uppercase. We also have a better understanding of what the region field represents after viewing the values.
6. Exploring with queries
You can also add a GROUP BY instead of DISTINCT to get this report. This query may perform better, but our dataset is not large enough to worry about performance. Regardless, it’s an important concept to know about.
7. Field-level aggregations
Another exploration technique is to include an aggregated metric and sort the data to identify value distributions. By adding COUNT(*) to our query, you can see which values are most common.
8. Field-level aggregations
The same technique can be used with other metrics, such as sum, max, min, and average. Sum is useful when looking at revenue sources in a business. If 99% of your company’s revenue comes from two sources, then you may want to focus on these sources when building reports.
9. Table-level aggregations
You can also leverage aggregations on the table level. A simple COUNT(*) shows how large a table is, which may impact how you construct your query as well as your understanding of what the table represents. If you expected only a few rows, but the table has several thousand rows, the table may have a different meaning than you expected.
10. Query validation
These data exploration techniques can also be used to validate queries. Remember that an INNER JOIN only keeps common values between two tables. Because of this, it’s possible we lose rows after a join.
11. Query validation
A useful approach is to turn your query into a subquery and run an aggregation on it. By comparing total revenue in the original table versus our query, we can identify any inconsistencies in our data. In this example, we are losing revenue in our query due to some rows being excluded after the JOIN.
12. Query validation
The same applies when testing for duplication. In this example, some rows are showing multiple times after the join, causing a major increase in revenue. Validating your query is a crucial step to ensure your report is reliable.
13. Let's explore!
Let’s practice exploring our dataset, then finish up the chapter by creating our first report in the dashboard.