Data validation
1. Data validation
Data validation is an important early step in EDA. We want to understand whether data types and ranges are as expected before we progress too far in our analysis! Let's dive in.2. Validating data types
We learned in the last lesson that dot-info gives a quick overview of data types included in a dataset along with other information such as the number of non-missing values. We can also use the DataFrame dot-dtypes attribute if we're only interested in data types. But what if we aren't happy with these data types? Here, the year column in the books DataFrame is stored as a float, which doesn't make sense for year data, which should always be a whole number.3. Updating data types
Luckily, the dot-astype function allows us to change data types without too much effort. Here, we redefine the year column by selecting the column and calling the dot-astype method, indicating we'd like to change the column to an integer. Then we use the dot-dtypes attribute to check that the year column data is now stored as integers - and it is!4. Updating data types
Common programming data types as well as their Python names are listed here. It's the Python name that we pass to the astype function, as we did with int on the previous slide.5. Validating categorical data
We can validate categorical data by comparing values in a column to a list of expected values using dot-isin, which can either be applied to a Series as we'll show here or to an entire DataFrame. Let's check whether the values in the genre column are limited to "Fiction" and "Non Fiction" by passing these genres as a list of strings to dot-isin. The function returns a Series of the same size and shape as the original but with True and False in place of all values, depending on whether the value from the original Series was included in the list passed to dot-isin. We can see that some values are False.6. Validating categorical data
We can also use the tilde operator at the beginning of the code block to invert the True/ False values so that the function returns True if the value is NOT in the list passed to dot-isin.7. Validating categorical data
And if we're interested in filtering the DataFrame for only values that are in our list, we can use the isin code we just wrote to filter using Boolean indexing!8. Validating numerical data
Let's now validate numerical data. We can select and view only the numerical columns in a DataFrame by calling the select_dtypes method and passing "number" as the argument.9. Validating numerical data
Perhaps we'd like to know the range of years in which the books in our dataset were published. We can check the lowest and highest years by using the dot-min and dot-max functions, respectively. And we can view a more detailed picture of the distribution of year data using Seaborn's boxplot function. The boxplot shows the boundaries of each quartile of year data: as we saw using min and max, the lowest year is 2009 and the highest year is 2019. The 25th and 75th percentiles are 2010 and 2016 and the median year is 2013.10. Validating numerical data
We can also view the year data grouped by a categorical variable such as genre by setting the y keyword argument. It looks like the children's books in our dataset have slightly later publishing years in general, but the range of years is the same for all genres.11. Let's practice!
Now it's your turn to validate the unemployment data!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.