Tratamiento de duplicados

En el último ejercicio, pudiste comprobar que la nueva actualización que alimenta ride_sharing contiene un error que genera filas duplicadas tanto completas como incompletas para algunos valores de la columna ride_id, con valores discrepantes ocasionales para las columnas user_birth_year y duration.

En este ejercicio, tratarás esas filas duplicadas eliminando primero los duplicados completos y fusionando después las filas duplicadas incompletas en una sola, manteniendo la media de duration, y el mínimo de user_birth_year para cada conjunto de filas duplicadas incompletas.

Este ejercicio forma parte del curso

Limpieza de datos en Python

Instrucciones del ejercicio

Elimina los duplicados completos en ride_sharing y almacena los resultados en ride_dup.
Crea el diccionario statistics que contiene la agregación mínima de user_birth_year y la agregación media para duration.
Elimina los duplicados incompletos agrupando por ride_id y aplicando la agregación en statistics.
Vuelve a encontrar duplicados y ejecuta la instrucción assert para verificar la desduplicación.

ejercicio interactivo práctico

Prueba este ejercicio completando este código de ejemplo.

# Drop complete duplicates from ride_sharing
ride_dup = ____.____()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': ____, 'duration': ____}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.____('____').____(____).reset_index()

# Find duplicated values again
duplicates = ride_unique.____(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

Editar y ejecutar código

Este ejercicio forma parte del curso

Limpieza de datos en Python

IntermedioNivel de habilidad

4.8+

Empieza el curso gratis

In this chapter, you'll learn how to overcome some of the most common dirty data problems. You'll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

Exercise 1: Restricciones del tipo de datos Exercise 2: Tipos de datos comunes Exercise 3: ¿Datos numéricos o ... ?Exercise 4: Sumar cadenas y concatenar números Exercise 5: Restricciones del rango de datos Exercise 6: Limitaciones del tamaño de los neumáticos Exercise 7: Regreso al futuro Exercise 8: Restricciones de unicidad Exercise 9: ¿Qué tamaño tiene tu subconjunto?Exercise 10: Encontrar duplicados Exercise 11: Tratamiento de duplicados

Ejercicio actual

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

Exercise 1: Membership constraints Exercise 2: Members only Exercise 3: Finding consistency Exercise 4: Categorical variables Exercise 5: Categories of errors Exercise 6: Inconsistent categories Exercise 7: Remapping categories Exercise 8: Cleaning text data Exercise 9: Removing titles and taking names Exercise 10: Keeping it descriptive

In this chapter, you'll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You'll also gain invaluable skills that will help you verify that values have been added correctly, and that missing values don't negatively impact your analyses.

Exercise 1: Uniformity Exercise 2: Ambiguous dates Exercise 3: Uniform currencies Exercise 4: Uniform dates Exercise 5: Cross field validation Exercise 6: Cross field or no cross field?Exercise 7: How's our data integrity?Exercise 8: Completeness Exercise 9: Is this missing at random?Exercise 10: Missing investors Exercise 11: Follow the money

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you'll then use your new skills to join two restaurant review datasets into one clean master dataset.

Exercise 1: Comparing strings Exercise 2: Minimum edit distance Exercise 3: The cutoff point Exercise 4: Remapping categories II Exercise 5: Generating pairs Exercise 6: To link or not to link?Exercise 7: Pairs of restaurants Exercise 8: Similar restaurants Exercise 9: Linking DataFrames Exercise 10: Getting the right index Exercise 11: Linking them together!Exercise 12: Congratulations!