In this chapter, you'll learn how to overcome some of the most common dirty data problems. You'll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

Data type constraints

Common data types

Numeric data or ... ?

Summing strings and concatenating numbers

Data range constraints

Tire size constraints

Back to the future

Uniqueness constraints

How big is your subset?

Finding duplicates

Treating duplicates

Common data problems

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

Membership constraints

Options

Membership Constraint

Other Constraint

Members only

Finding consistency

Categorical variables

White spaces and inconsistency

Creating or remapping categories

Categories of errors

Inconsistent categories

Remapping categories

Cleaning text data

Removing titles and taking names

Keeping it descriptive

Text and categorical data problems

In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.

Uniformity

Ambiguous dates

Uniform currencies

Uniform dates

Cross field validation

Cross field or no cross field?

How's our data integrity?

Completeness

Is this missing at random?

Missing investors

Follow the money

Advanced data problems

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

Comparing strings

Minimum edit distance

The cutoff point

Remapping categories II

Generating pairs

To link or not to link?

Pairs of restaurants

Similar restaurants

Linking DataFrames

Getting the right index

Linking them together!

Congratulations!

Record linkage

Ride sharing dataset

Airlines dataset

Banking dataset

Restaurants dataset

Restaurants dataset II

It's commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions.

Data cleaning is an essential task in data science. Without properly cleaned data, the results of any data analysis or machine learning model could be inaccurate. In this course, you will learn how to identify, diagnose, and treat a variety of data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more!

<h2>Discover How to Clean Data in Python</h2>
It's commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. Data cleaning is an essential step for every data scientist, as analyzing dirty data can lead to inaccurate conclusions. 
<br><br>
In this course, you will learn how to identify, diagnose, and treat various data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more!
<br><br>
<h2>Learn How to Clean Different Data Types</h2>
The first chapter of the course explores common data problems and how you can fix them. You will first understand basic data types and how to deal with them individually. After, you'll apply range constraints and remove duplicated data points.
<br><br>
The last chapter explores record linkage, a powerful tool to merge multiple datasets. You'll learn how to link records by calculating the similarity between strings. Finally, you'll use your new skills to join two restaurant review datasets into one clean master dataset.
<br><br>
<h2>Gain Confidence in Cleaning Data</h2>
By the end of the course, you will gain the confidence to clean data from various types and use record linkage to merge multiple datasets. Cleaning data is an essential skill for data scientists. If you want to learn more about cleaning data in Python and its applications, check out the following tracks: Data Scientist with Python and Importing & Cleaning Data with Python.

Python Toolbox

Joining Data with pandas

Master cleaning Python data in this four-hour course. You will explore how to clean common and advanced data problems along with record linkage.

Cleaning Data in Python

Learn to diagnose and treat dirty data and develop the skills needed to transform your raw data into accurate insights! 

Associate Data Scientist  in Python

Data Engineer in Python

Importing & Cleaning Data  in Python

Likely to Recommend

Removing titles and taking names

“Cleaning Data in Python”

Exercise instructions

Hands-on interactive exercise

Cleaning Data in Python

Chapter 1: Common data problems

Chapter 2: Text and categorical data problems

Chapter 3: Advanced data problems

Chapter 4: Record linkage

What is DataCamp?