Juntando tudo

Você tem duas preocupações sobre seu pipeline na startup de detecção de arritmia:

O app foi treinado com pacientes de todas as idades, mas está sendo usado principalmente por pessoas focadas em fitness, que tendem a ser jovens. Você suspeita de um caso de mudança de domínio (domain shift) e, por isso, quer desconsiderar todos os exemplos acima de 50 anos.
Você ainda está preocupado com overfitting, então quer verificar se reduzir a complexidade do classificador random forest e selecionar algumas features pode ajudar.

Você vai criar um pipeline com uma etapa de seleção de features SelectKBest() e um RandomForestClassifier, ambos já importados. Você também tem acesso a GridSearchCV(), Pipeline, numpy como np e pickle. Os dados estão disponíveis como arrh.

Este exercicio faz parte do curso

Projetando Workflows de Machine Learning em Python

Ver curso

Instruções do exercicio

Crie um pipeline com SelectKBest() como etapa ft e RandomForestClassifier() como etapa clf.
Crie uma grade de parâmetros para ajustar k em SelectKBest() e max_depth em RandomForestClassifier().
Use GridSearchCV() para otimizar seu pipeline com essa grade e com os dados contendo apenas pessoas com menos de 50 anos.
Salve o pipeline otimizado em um arquivo pickle para produção.

exercicio interativo prático

Tente este exercicio completando este código de exemplo.

# Create a pipeline 
pipe = Pipeline([
  ('ft', ____), ('clf', ____(random_state=2))])

# Create a parameter grid
grid = {'ft__k':[5, 10], '____':[10, 20]}

# Execute grid search CV on a dataset containing under 50s
grid_search = ____(pipe, param_grid=grid)
arrh = arrh.____[____(arrh['age'] < 50)]
____.____(arrh.drop('class', 1), arrh['class'])

# Push the fitted pipeline to production
with ____('pipe.pkl', ____) as file:
    pickle.dump(____, file)

Editar e Executar Código

Projetando Workflows de Machine Learning em Python

AvançadoNível de habilidade

4.8+

94 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks