Imputazione con regressione logistica

Una scelta diffusa per imputare variabili binarie è la regressione logistica. Purtroppo, non esiste una funzione simile a impute_lm() che lo faccia. Ecco perché scriverai tu stesso una funzione!

Chiamiamo la funzione impute_logreg(). Il suo primo argomento sarà un data frame df, i cui valori mancanti sono stati inizializzati e che contiene valori mancanti solo nella colonna da imputare. Il secondo argomento sarà una formula per il modello di regressione logistica.

La funzione farà quanto segue:

Mantieni le posizioni dei valori mancanti.
Costruisci il modello.
Fai le previsioni.
Sostituisci i valori mancanti con le previsioni.

Non preoccuparti della riga che crea imp_var - è solo un modo per estrarre dal formula il nome della colonna da imputare. Facciamo un po' di programmazione funzionale!

Questo esercizio fa parte del corso

Gestione dei dati mancanti con imputazioni in R

Visualizza il corso

Istruzioni dell'esercizio

Crea una maschera booleana per i casi in cui df[imp_var] è mancante e assegnala a missing_imp_var.
Stima un modello di regressione logistica usando la formula e i dati che la funzione riceverà come argomenti, ricordandoti di impostare il family corretto per assicurarti che venga stimata una regressione logistica (passalo senza virgolette); assegna il modello a logreg_model.
Predici la risposta con il modello e assegnala a preds; ricorda di impostare il type di previsione appropriato.
Usa preds insieme a missing_imp_var per imputare i valori mancanti.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

impute_logreg <- function(df, formula) {
  # Extract name of response variable
  imp_var <- as.character(formula[2])
  # Save locations where the response is missing
  missing_imp_var <- ___
  # Fit logistic regression mode
  logreg_model <- ___(___, data = ___, family = ___)
  # Predict the response and convert it to 0s and 1s
  preds <- predict(___, type = ___)
  preds <- ifelse(preds >= 0.5, 1, 0)
  # Impute missing values with predictions
  df[missing_imp_var, imp_var] <-___[___]
  return(df)
}

Modifica ed esegui il codice

Questo esercizio fa parte del corso

Gestione dei dati mancanti con imputazioni in R

AvançadoNível de habilidade

4.7+

Inizia il corso gratis

In this chapter, you’ll find out why missing data can be a risk when analyzing a dataset. You’ll be introduced to the three missing data mechanisms and learn how to recognize them using statistical tests and visualization tools.

Exercise 1: Missing data: what can go wrong Exercise 2: Linear regression with incomplete data Exercise 3: Analyzing regression output Exercise 4: Comparing models Exercise 5: Missing data mechanisms Exercise 6: Recognizing missing data mechanisms Exercise 7: t-test for MAR: data preparation Exercise 8: t-test for MAR: interpretation Exercise 9: Visualizing missing data patterns Exercise 10: Aggregation plot Exercise 11: Spine plot Exercise 12: Mosaic plot

Get to know the taxonomy of imputation methods and learn three donor-based techniques: mean, hot-deck, and k-Nearest-Neighbors imputation. You’ll look under the hood to see how these methods work, before learning how to apply them to a real-world tropical weather dataset. Along the way, you’ll also learn useful tricks that you can use to make them work even better for your problems.

Exercise 1: Mean imputation Exercise 2: Smelling the danger of mean imputation Exercise 3: Mean-imputing the temperature Exercise 4: Assessing imputation quality with margin plot Exercise 5: Hot-deck imputation Exercise 6: Vanilla hot-deck Exercise 7: Hot-deck tricks & tips I: imputing within domains Exercise 8: Hot-deck tricks & tips II: sorting by correlated variables Exercise 9: k-Nearest-Neighbors imputation Exercise 10: Choosing the number of neighbors Exercise 11: kNN tricks & tips I: weighting donors Exercise 12: kNN tricks & tips II: sorting variables

It’s time to learn how to use statistical and machine learning models, such as linear regression, logistic regression, and random forests, to impute missing data. In this chapter, you’ll look into how the models make their predictions and use this knowledge to draw the imputed values from conditional distributions. This is important as it ensures your imputations are more varied and plausible, making them more similar to the true data.

Exercise 1: Approccio di imputazione basato su modello Exercise 2: Imputazione con regressione lineare Exercise 3: Inizializzare i valori mancanti e iterare sulle variabili Exercise 4: Rilevare la convergenza Exercise 5: Replica della variabilità dei dati Exercise 6: Imputazione con regressione logistica

Esercizio in corso

Exercise 7: Estrazione dalla distribuzione condizionata Exercise 8: Imputazione basata su modelli con variabili di tipi diversi Exercise 9: Imputazione basata su alberi Exercise 10: Imputazione con random forest Exercise 11: Errori di imputazione per variabile Exercise 12: Compromesso tra velocità e accuratezza

Imputed values are not set in stone. They are just estimates and estimates come with some uncertainty. In this final chapter, you’ll discover how bootstrapping and chained equation using the mice package can be used to incorporate imputation uncertainty into your models and analyses to make them more reliable and robust.

Exercise 1: Multiple imputation by bootstrapping Exercise 2: Wrapping imputation & modeling in a function Exercise 3: Running the bootstrap Exercise 4: Bootstrapping confidence intervals Exercise 5: Multiple imputation by chained equations Exercise 6: The mice flow: mice - with - pool Exercise 7: Choosing default models Exercise 8: Using predictor matrix Exercise 9: Putting it all together Exercise 10: Analyzing missing data patterns Exercise 11: Imputing and inspecting outcomes Exercise 12: Inference with imputed data Exercise 13: Final remarks