Estatísticas de validação cruzada

Você usou grid search com validação cruzada para ajustar seu classificador de random forest e agora quer inspecionar os resultados da validação cruzada para garantir que não houve overfitting. Em particular, você gostaria de calcular a diferença entre a pontuação média de teste de cada fold e a pontuação média de treino. O conjunto de dados está disponível como X_train e y_train, o pipeline como pipe, e vários módulos já estão carregados, incluindo pandas como pd e GridSearchCV().

Este exercicio faz parte do curso

Projetando Workflows de Machine Learning em Python

Ver curso

Instruções do exercicio

Crie um objeto de grid search com três folds de validação cruzada e garanta que ele retorne estatísticas de treino e de teste.
Ajuste o objeto de grid search aos dados de treino.
Armazene os resultados da validação cruzada, disponíveis no atributo cv_results_ do objeto de CV ajustado, em um dataframe.
Imprima a diferença entre a coluna que contém a média da pontuação de teste e a que contém a média da pontuação de treino.

exercicio interativo prático

Tente este exercicio completando este código de exemplo.

# Fit your pipeline using GridSearchCV with three folds
grid_search = GridSearchCV(
  pipe, params, ____=3, return_train_score=____)

# Fit the grid search
gs = grid_search.____(____, ____)

# Store the results of CV into a pandas dataframe
results = pd.____(gs.____)

# Print the difference between mean test and training scores
print(
  results[____]-results['mean_train_score'])

Editar e Executar Código

Projetando Workflows de Machine Learning em Python

AvançadoNível de habilidade

4.8+

94 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks