Aggregating partial duplicates

Another way of handling partial duplicates is to compute a summary statistic of the values that differ between partial duplicates, such as mean, median, maximum, or minimum. This can come in handy when you're not sure how your data was collected and want an average, or if based on domain knowledge, you'd rather have too high of an estimate than too low of an estimate (or vice versa).

dplyr is loaded and bike_share_rides is available.

Questo esercizio fa parte del corso

Cleaning Data in R

Visualizza il corso

Istruzioni dell'esercizio

Group bike_share_rides by ride_id and date.
Add a column called duration_min_avg that contains the mean ride duration for the row's ride_id and date.
Remove duplicates based on ride_id and date, keeping all columns of the data frame.
Remove the duration_min column.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

bike_share_rides %>%
  # Group by ride_id and date
  ___ %>%
  # Add duration_min_avg column
  mutate(duration_min_avg = ___ ) %>%
  # Remove duplicates based on ride_id and date, keep all cols
  ___ %>%
  # Remove duration_min column
  ___(-___)

Modifica ed esegui il codice

Questo esercizio fa parte del corso

Cleaning Data in R

IntermediárioNível de habilidade

4.8+

Inizia il corso gratis

In this chapter, you'll learn how to overcome some of the most common dirty data problems. You'll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

Exercise 1: Data type constraints Exercise 2: Common data types Exercise 3: Converting data types Exercise 4: Trimming strings Exercise 5: Range constraints Exercise 6: Ride duration constraints Exercise 7: Back to the future Exercise 8: Uniqueness constraints Exercise 9: Full duplicates Exercise 10: Removing partial duplicates Exercise 11: Aggregating partial duplicates

Esercizio in corso

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

Exercise 1: Checking membership Exercise 2: Members only Exercise 3: Not a member Exercise 4: Categorical data problems Exercise 5: Identifying inconsistency Exercise 6: Correcting inconsistency Exercise 7: Collapsing categories Exercise 8: Cleaning text data Exercise 9: Detecting inconsistent text data Exercise 10: Replacing and removing Exercise 11: Invalid phone numbers

In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.

Exercise 1: Uniformity Exercise 2: Date uniformity Exercise 3: Currency uniformity Exercise 4: Cross field validation Exercise 5: Validating totals Exercise 6: Validating age Exercise 7: Completeness Exercise 8: Types of missingness Exercise 9: Visualizing missing data Exercise 10: Treating missing data

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

Exercise 1: Comparing strings Exercise 2: Calculating distance Exercise 3: Small distance, small difference Exercise 4: Fixing typos with string distance Exercise 5: Generating and comparing pairs Exercise 6: Link or join?Exercise 7: Pair blocking Exercise 8: Comparing pairs Exercise 9: Scoring and linking Exercise 10: Score then select or select then score?Exercise 11: Putting it together Exercise 12: Congratulations!