Remapeamento de categorias II

No último exercício, você determinou que o ponto de corte de distância para remapear os erros de digitação dos tipos de cozinha 'american', 'asian' e 'italian' armazenados na coluna cuisine_type deve ser 80.

Neste exercício, você vai reunir tudo isso encontrando correspondências com pontuações de similaridade iguais ou superiores a 80 usando a função fuzywuzzy.process's extract(), para cada tipo de cozinha correta, e substituindo essas correspondências por ela. Lembre-se de que, ao comparar uma string com uma matriz de strings usando process.extract(), a saída é uma lista de tuplas em que cada uma é formatada como:

(closest match, similarity score, index of match)

O DataFrame restaurants está em seu ambiente e você tem acesso a uma lista categories que contém os tipos de cozinha corretos ('italian', 'asian' e 'american').

Este exercicio faz parte do curso

Limpeza de dados em Python

exercicio interativo prático

Tente este exercicio completando este código de exemplo.

# Inspect the unique values of the cuisine_type column
print(____)

Editar e Executar Código

Este exercicio faz parte do curso

Limpeza de dados em Python

IntermediárioNível de habilidade

4.8+

Comece o curso gratuitamente

In this chapter, you'll learn how to overcome some of the most common dirty data problems. You'll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

Exercise 1: Data type constraints Exercise 2: Common data types Exercise 3: Numeric data or ... ?Exercise 4: Summing strings and concatenating numbers Exercise 5: Data range constraints Exercise 6: Tire size constraints Exercise 7: Back to the future Exercise 8: Uniqueness constraints Exercise 9: How big is your subset?Exercise 10: Finding duplicates Exercise 11: Treating duplicates

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

Exercise 1: Membership constraints Exercise 2: Members only Exercise 3: Finding consistency Exercise 4: Categorical variables Exercise 5: Categories of errors Exercise 6: Inconsistent categories Exercise 7: Remapping categories Exercise 8: Cleaning text data Exercise 9: Removing titles and taking names Exercise 10: Keeping it descriptive

In this chapter, you'll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You'll also gain invaluable skills that will help you verify that values have been added correctly, and that missing values don't negatively impact your analyses.

Exercise 1: Uniformity Exercise 2: Ambiguous dates Exercise 3: Uniform currencies Exercise 4: Uniform dates Exercise 5: Cross field validation Exercise 6: Cross field or no cross field?Exercise 7: How's our data integrity?Exercise 8: Completeness Exercise 9: Is this missing at random?Exercise 10: Missing investors Exercise 11: Follow the money

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you'll then use your new skills to join two restaurant review datasets into one clean master dataset.

Exercise 1: Comparação de strings Exercise 2: Distância mínima de edição Exercise 3: O ponto de corte Exercise 4: Remapeamento de categorias II

Exercicio Atual

Exercise 5: Geração de pares Exercise 6: Criar ou não criar um link?Exercise 7: Pares de restaurantes Exercise 8: Restaurantes similares Exercise 9: Vinculação de DataFrames Exercise 10: Obtendo o índice correto Exercise 11: Conectando-os!Exercise 12: Parabéns!