MulaiMulai sekarang secara gratis

Keeping missing data

In some situations, the fact that an input is missing is important information in itself. NAs can be kept in a separate "missing" category using coarse classification.

Coarse classification allows you to simplify your data and improve the interpretability of your model. Coarse classification requires you to bin your responses into groups that contain ranges of values. You can use this binning technique to place all NAs in their own bin.

In the video, we illustrated the idea of coarse classification for employment length. The code from that example has been reproduced in the R script to the right and can be adapted to coarse classify the int_rate variable.

Latihan ini adalah bagian dari kursus

Credit Risk Modeling in R

Lihat Kursus

Petunjuk latihan

  • Make the necessary changes to the code provided to coarse classify int_rate, saving the result to a new variable called ir_cat.
    • First, replace loan_data$emp_cat by loan_data$ir_cat where it occurs in the R script, as well as replacing loan_data$emp_length by loan_data$int_rate.
    • Next, the variables should be binned in categories "0-8", "8-11", "11-13.5", and "13.5+" (replacing "0-15","15-30","30-45" and "45+"). Usage of > and <= is exactly as in the video. Make sure to change the numbers in the conditional statements too (15, 30 and 45 should be changed to 8, 11 and 13.5 respectively).
  • Look at your new variable ir_cat using plot(loan_data$ir_cat).

Latihan interaktif praktis

Cobalah latihan ini dengan menyelesaikan kode contoh berikut.

# Make the necessary replacements in the coarse classification example below 
loan_data$emp_cat <- rep(NA, length(loan_data$emp_length))

loan_data$emp_cat[which(loan_data$emp_length <= 15)] <- "0-15"
loan_data$emp_cat[which(loan_data$emp_length > 15 & loan_data$emp_length <= 30)] <- "15-30"
loan_data$emp_cat[which(loan_data$emp_length > 30 & loan_data$emp_length <= 45)] <- "30-45"
loan_data$emp_cat[which(loan_data$emp_length > 45)] <- "45+"
loan_data$emp_cat[which(is.na(loan_data$emp_length))] <- "Missing"

loan_data$emp_cat <- as.factor(loan_data$emp_cat)

# Look at your new variable using plot()
Edit dan Jalankan Kode