Juntándolo todo

Tienes dos preocupaciones sobre tu pipeline en la startup de detección de arritmias:

La app se entrenó con pacientes de todas las edades, pero la usan sobre todo personas aficionadas al fitness, que suelen ser jóvenes. Sospechas que puede haber un cambio de dominio, así que quieres descartar todos los ejemplos de mayores de 50 años.
Aún te preocupa el sobreajuste, por lo que quieres ver si hacer que el clasificador random forest sea menos complejo y seleccionar algunas características puede ayudar.

Vas a crear una pipeline con un paso de selección de características SelectKBest() y un RandomForestClassifier, ambos ya importados. También tienes acceso a GridSearchCV(), Pipeline, numpy como np y pickle. Los datos están disponibles como arrh.

Este ejercicio forma parte del curso

Diseño de flujos de trabajo de Machine Learning en Python

Ver curso

Instrucciones del ejercicio

Crea una pipeline con SelectKBest() como paso ft y RandomForestClassifier() como paso clf.
Crea una rejilla de parámetros para ajustar k en SelectKBest() y max_depth en RandomForestClassifier().
Usa GridSearchCV() para optimizar tu pipeline con esa rejilla y los datos que contengan solo a quienes tienen menos de 50 años.
Guarda la pipeline optimizada en un pickle para producción.

ejercicio interactivo práctico

Prueba este ejercicio completando este código de ejemplo.

# Create a pipeline 
pipe = Pipeline([
  ('ft', ____), ('clf', ____(random_state=2))])

# Create a parameter grid
grid = {'ft__k':[5, 10], '____':[10, 20]}

# Execute grid search CV on a dataset containing under 50s
grid_search = ____(pipe, param_grid=grid)
arrh = arrh.____[____(arrh['age'] < 50)]
____.____(arrh.drop('class', 1), arrh['class'])

# Push the fitted pipeline to production
with ____('pipe.pkl', ____) as file:
    pickle.dump(____, file)

Editar y ejecutar código

Diseño de flujos de trabajo de Machine Learning en Python

AvanzadoNivel de habilidad

4.8+

94 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks