Not all metrics agree

In the previous exercise you saw that not all metrics agree when it comes to identifying nearest neighbors. But does this mean they might disagree on outliers, too? You decide to put this to the test. You use the same data as before, but this time feed it into a local outlier factor outlier detector. The module LocalOutlierFactor has been made available to you as lof, and the data is available as features.

This exercise is part of the course

Designing Machine Learning Workflows in Python

View Course

Exercise instructions

Detect outliers in features using the euclidean metric.
Detect outliers in features using the hamming metric.
Detect outliers in features using the jaccard metric.
Find if all three metrics agree on any one outlier.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Compute outliers according to the euclidean metric
out_eucl = ____(metric='euclidean').fit_predict(features)

# Compute outliers according to the hamming metric
out_hamm = ____(metric=____).fit_predict(features)

# Compute outliers according to the jaccard metric
out_jacc  = ____(____=____).____(features)

# Find if the metrics agree on any one datapoint
print(any(____ + ____ + ____ == ____))

Edit and Run Code

Designing Machine Learning Workflows in Python

AdvancedSkill Level

4.8+

58 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree

Current Exercise

Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks