1. Combining data tables
Hello! I'm Lis and I'll be one of your instructor in this course on how to effectively bring data into Tableau.
2. Combining data
Combining data tables is a first common step in data analysis as it rarely comes in the exact form you need. Instead, usually you have multiple data files, whether it's CSVs or Excel sheets, that need to be merged. Rather than writing code to do this, Tableau provides a drag and drop experience to combine files.
Let's review two common ways to combine data: Unions and Joins.
3. Unions
Consider unions as a way to stack data vertically. A common business use case for unions is when data is generated by the same system, but stored separately and broken down by time period.
4. Union example
For example, the Divvy Bike data, used in some of our other courses, is offered to the public in quarterly files. In many companies, the volume of data is so large that it is stored by week or by month. There comes a time when a full year perspective is required. Simply union the data tables together for a larger view. The data should contain the same structure, meaning same number of columns preferably in the same order containing the same data types.
5. Joins
Alternatively, if you need to combine tables where there's a relationship between the tables, a join may be better. Traditionally, data tables are structured using a relational strategy to minimized redundancy and have unique identifiers, often called keys or IDs, to link tables together.
6. Joins example
Here's an example with two tables. The first is a employees table where we have the employees' names, department, and email of a fictional company. It also has a column for location ID to represent which office the employee belongs to. Note that Bob in the last row is remote so there is no ID. In the second table, Offices, there is a row for each office location with the corresponding ID and the physical address. In table 2, the location ID is unique in each row, because we only want one record for each office. While in the employees table, a location id can show up multiple times and should because there is more than one employee at each location.
To get the address each employee works at, we would opt for a join using Location ID as the relationship between the tables.
7. Joins example
This is the result of a join defined by location id. The correct addresses have been appended to the Employees table. In the case of the remote employee Bob, we get a Null address because there is no corresponding row in the Offices table.
8. Types of Joins
There are different ways you can join tables.
There's an inner join where only matched rows are returned.
A left join returns all the matched rows and all the other rows in the left table that don't have a match.
The right join is the similar but with the matched rows and all the unmatched rows from the right table.
And finally, there's a full outer join that returns all the matched rows and all unmatched rows in both tables.
9. Back to our example
Back to our example. which type of join was used? Take a pause and think.
It's a left join. All the rows on the left table, Employees, are in the result. Not all the rows on the right table, Offices, are in the result. That's because none of the employees in the Employee tables are located in Location ID 3, London.
Inner join is a close guess! However, it can't be an inner join, because Bob's row is in the table and he did not have a matching office row, thus returning null.
10. Superstore dataset
We are going to apply these concepts on a dataset from a fictional company called Superstore. There are a total of 7 tables. The first five are annual datasets from 2016 to 2020 on the orders placed. This includes columns like order id, what was bought, who bought it, who sold it, where, when, etc. The next table is Returns with the ids of orders that were returned. Lastly, there's Sales Reps with the name of the sales representatives for each region.
11. Let's practice!
Let's practice!