Get startedGet started for free

Introduction to logistic regression

1. Introduction to logistic regression

Now that you have split your data into training and testing sets, it's time to build a model. In this course, you will use logistic regression to predict the probability of employee turnover. So what is logistic regression?

2. What is logistic regression?

Logistic regression is a classification technique. The goal of classification is to identify the category to which a data point belongs. Logistic regression does this by giving you the probability that a certain data point belongs to a target class. In this case, the target class is turnover. So logistic regression will give you the probability of turnover. If you are familiar with linear regression, you can think of logistic regression as a special case of linear regression when the dependent variable is categorical.

3. Understanding logistic regression

In logistic regression, the independent variables can either be continuous or categorical, while the dependent variable is binary.

4. Building a simple logistic regression model

You can use the glm() function along with the family argument set to "binomial" to build a logistic regression model in R. Here we build a simple logistic regression model by regressing emp_age on turnover.

5. Understanding the output of simple logistic regression model

To see a detailed output of the model, you can call the summary() function on the model object. The coefficients table lists the independent variable and it's significance to the dependent variable. The statistical significance of the variable can be interpreted by looking at significance codes. As per this table, emp_age is a highly significant variable.

6. Removing variables

Often you will have multiple independent variables, as is the case with your current dataset. So you would want to predict turnover based on all these variables, as opposed to just one. Before we do this, let's remove some columns that are either irrelevant or add no new information. ID columns are irrelevant when trying to predict turnover. We derived the tenure variable based on date_of_joining, last_working_date, cutoff_date, so these 3 new columns give us no new information. median_compensation is directly related to level. We used mgr_age and emp_age to calculate age_diff. Department has only one possible value across the entire dataset, so it doesn't have any predictive power whatsoever, and finally, status is the same as turnover.

7. Removing variables

Let's remove all these variables from our training set.

8. Building multiple logistic regression model

Now, to use all the independent variables in the dataset to build a multiple logistic regression model, you call the glm() function with the family argument set to "binomial" again and use the new smaller data frame, train_set_multi. Note the change in formula. Here we use turnover tilde period. This tells the glm() function to regress all the variables in the data frame on our dependent variable, turnover.

9. Understanding the output of multiple logistic regression model

To see a detailed output of the model, you can use the summary() function again. Note that we are showing only a subset of variables from the full output here. As you can see from the table, looks like tenure and percent_hike are highly significant whereas total_experience is not. Similarly, there are some other variables that are not significant in this model, which you can see in one of the following exercises.

10. Let's practice!

It's time for you to build these models yourselves!