Introduction and overview

1. Introduction to HR analytics

Hello and welcome to "HR analytics in Python" course. My name is Hrant Davtyan, I am a Business Analyst teaching Data Science and providing consultancy related to statistics. Among all of the business domains, HR is still the least disrupted. However, the latest developments in Data collection and analysis tools and technologies allow for data driven decision-making in all dimensions including HR. As a consequence, HR analytics is a growing field and I believe it is the correct time to tap into that industry.

2. What is HR analytics?

HR analytics is also known as People analytics and it is nothing else than a data-driven approach to managing people at work.

3. Problems addressed by HR analytics

There are many problems in HR that can be addressed using data-driven approach. Among those are decisions related to employee hiring and retention, performance evaluation, collaboration and else. In this course, we will concentrate on Predicting employee turnover which is related to the first 2 bullet points: Hiring and retention.

4. Employee turnover

Employee turnover is the process of employees leaving the company also known as employee attrition or employee churn. When skilled employees leave, this can be very costly for the company, thus firms are interested in predicting turnover beforehand. Having that information in hand, companies can change their strategy to retain good workers or start the hiring process of new employees on time.

5. Course structure

In this course, we will use a sample employee dataset with variables that describe employees in the company to predict their turnover and understand what are the most important features affecting it. The 1st chapter will concentrate on descriptive analytics, where we will transform the dataset and make it ready for developing the predictive model. In the 2nd chapter we will develop an initial model that will then be tuned and improved in the 3rd chapter. The final chapter will introduce techniques that will allow selection of the best model for decision-making.

6. The dataset

So let's start by taking a quick look to our dataset. The data is provided in csv format and is located in the working directory. This means we can use the read_csv() function from the pandas library to read it. Once the dataset is read into a new pandas DataFrame called "data", we can use the info() method to get some information on it. As you can see from the output we have 10 columns and almost 15000 entries, which means the DataFrame includes data on almost 15000 employees about 10 different variables. Among those 10, only 2 have the type object, while others are either float or int. The latter means that our variables are numeric, numbers, that can be used to perform mathematical and statistical computations on, while the object types are called categorical variables and they need to be transformed first, before moving on. Therefore, let's take a quick look to our dataset to see what it looks like and what are those 2 categorical variables we have there.

7. The dataset

We can use the head() method to take a look to the first 5 rows of the DataFrame. As you can see, the last two columns are "department" and "salary", which are giving information about the department an employee is working at and the salary s/he is receiving, respectively. Both of them describe some category of an employee (belonging to this or that department or salary group), which is the reason they are called categorical variables.

8. Unique values

In order to understand what are the values that those columns get, we have to first choose the relevant columns, and then use a method called **unique()** to print only the unique values in that column.

9. Let's practice!

Now it's your turn to practice.