Detecting inconsistent data
1. Detecting inconsistent data
To conclude this chapter, we will focus our attention on data records that contain inconsistent values due to violations of business rules.2. Inconsistent data
For examples of inconsistent data, let's return to our restaurant inspection dataset. Several rules determine restaurant inspection results. While the score column was discussed previously, the details behind the observed values follow a set of scoring rules. The inspection score is an aggregate value representing the number of food safety violations encountered during an inspection. Each violation increases the restaurant's score. Therefore, lower scores are better in this context. Only certain inspection types receive inspection scores. In addition to scores, the city requires restaurants to display a letter grade corresponding to the restaurant's inspection score. An A represents a score between 0 and 13. A B corresponds to a score between 14 and 27. Any score above 28 is a C. Scores are only converted to grades in two scenarios: initial inspections resulting in an A and re-inspections that result in an A, B, or C grade.3. Checking rules with SQL
These inspection evaluation rules are somewhat complicated. Due to column interdependencies, enforcing these rules using database constraints is difficult. The rules explained previously involve 3 different columns in the restaurant inspection table. Data violating these business rules can easily enter this table. However, we have the tools at our disposal to encode these rules in SQL queries. These queries can identify records that violate these rules. We can then clean these violating records. As an example, an A grade should only be awarded when the inspection score is between 0 and 13. This query identifies records that do not satisfy this rule. However, in this sample data, no records violate this rule.4. Checking rules with SQL
Recall that scores between 14 and 27 receive a B when re-inspections are performed. The query shown here identifies inspections assigned a B grade with a score outside of the expected range. This query identifies a record that violates this rule. After identifying this inspection, an investigation of the cause of the violation can commence. Identifying inconsistent data is not limited to numerical ranges. Another inspection rule states that when a restaurant is closed by the New York City Health Department, an inspection is required to re-open the restaurant. Such inspections do not receive letter grades.5. Checking rules with SQL
The query shown here can identify re-opening inspections that have received a letter grade. This query identifies several records that represent such inconsistencies in the data. Evaluating whether or not data is consistently adhering to business rules can be performed by translating the rules into WHERE clause conditions as we have seen in the examples presented in this lesson.6. Data cleaning insights
As you can see from the content in this chapter, data cleaning comes in many forms. Approaches for handling missing data are quite different from approaches for handling inconsistent data. While the data cleaning techniques covered in this chapter will prove useful in your data cleaning efforts, you have to carefully consider in which scenario you find yourself when exploring your data and then choose an approach that makes sense in that scenario. Having sufficient domain knowledge to understand which data needs to be cleaned and how to approach the cleaning should be the first step in the process before making any changes to the original data. Understanding valid ranges of values for your data, when duplicate data needs to be addressed, and how to fill in missing data are all important.7. Let's practice!
Before we move on to the next chapter, its your turn to identify inconsistent data in the following exercises.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.