1. Joining HR data
In the examples you've worked through so far, you worked with a single dataset that included all the data you needed. Unfortunately, working with HR data is rarely that convenient.
2. HR data systems
Basic employee data, such as tenure and demographic information, is not typically stored in the same location as other HR data. Benefits data, performance data, and recruiting data are all used for specific purposes, and are generated and stored in unique ways to meet those purposes, and not with the intention of making data analysis easier. Data from other parts of the business, such as financial data, can have similar issues.
3. Joining HR Data
In order to analyze data from different sources, you'll have to join them together into a single data frame. Analyzing HR data is much easier when it exists within a single data frame, as you've seen in the previous chapters. Each row represents one employee, and the columns are different characteristics of that employee.
In this course, you'll use dplyr's left_join() to join data sets together. You use left_join() to add data from one data frame, such as an extract of compensation data, to another data frame, such as the base HR data.
The first argument in left_join() is the data frame you want to add additional data to, such as a list of all employees from the base HR system. The second argument is the data frame from which you wish to bring in additional data. The third argument, "by", is where you specify the column with the unique employee identifier. This argument, called the "key", is how left_join() knows which rows of the two data frames refer to the same employees.
4. Using left_join()
What happens when one table includes data on employees that are not found in the other? This can happen if an employee didn't receive a bonus, or if an employee who received a bonus has since left the company.
In this example, employee 4 is in hr_data, but not in bonus_pay_data. The final data frame retains employee 4, but since there is no bonus information to add, the result is NA. On the other hand, employee 1 is in bonus_pay_data but not in the joined data frame. When you use left_join(), any employees in the second data frame that don't match employees in the first data frame are dropped, as you can see in the result of the join.
5. Other joins
dplyr has other join functions that can join data frames in different ways if you need to keep all the employees. You can learn about them in the DataCamp course about joining data in R with dplyr.
6. Choosing a key for joining
A quick note about the key when joining employee datasets. It is safest to use employee id, or some other field that uniquely represents an employee. Using a number is safer than using the employee name because names are not always unique, and employees do not always keep the same name over time.
7. Let's practice!
Now it's time to analyze data from multiple sources.