Data quality assessment with Tablesaw

1. Data quality assessment with Tablesaw

Let's look into data quality for tabular datasets.

2. Book sales dataset

Data cleaning starts with quality assessment. Using Tablesaw, a Java library for data analysis, we'll examine our best-selling books dataset for common issues like missing values, incorrect types, and unusual patterns. Understanding these issues helps us choose the right cleaning strategies. Let's look at our data and start assessing its quality.

3. Examining the data

When analyzing data quality, our first step is understanding the structure of our dataset. Tablesaw provides tools to load and examine tabular data. The `Table.read().csv()` method loads our bestselling books data, and basic methods like `.rowCount()` and `.columnNames()` give us an overview of what we're working with.

4. Examining the data: outputs

The outputs show us the row and column counts and the column names.

5. Checking for missing values

Missing values can significantly impact our analysis. We loop over the column names using `books.columnNames()`. We get a column by `columnName` using `books.column()`. The `.countMissing()` method counts missing data, or nulls, in the column. Here we see our bestsellers dataset is mostly complete, except for two missing values in the "Sales_in_millions" column. This output alerts us that we need to investigate these missing values. Next let's examine our columns in more detail.

6. Examining categorical columns

For categorical columns like language, we can check value distributions. The `.countBy()` method shows how many books we have in each language, helping us identify potential data entry inconsistencies if any of the counts are unexpected.

7. Examining categorical columns: outputs

The output shows that most of the books are in English and allows us to examine the distribution of other languages.

8. Analyzing numeric columns

Understanding the distribution of numeric values helps identify outliers. `books.doubleColumn()` reads a `Double` column like "Sales_in_millions" and provides statistical methods to understand our data's distribution. Extreme values in `.min()` or `.max()` might indicate data entry errors, while `.mean()` and `.standardDeviation()` help identify unusual patterns.

9. Analyzing numeric columns: outputs

We can observe the variations in book sales from the outputs. For example, the book with the smallest sales sold 10 million copies, while the book with the largest sales sold 600 million copies.

10. Putting it all together

We've learned how to check data quality using Tablesaw. Starting with data loading and structure checks like `.rowCount()`, and `.columnCount()`, we can examine missing values with `.countMissing()` , analyze categorical distributions with `.countBy()`, and calculate numeric statistics like `.min()`, `.max()`, and `.mean()` to identify potential issues before analysis.

11. Putting it all together: outputs

The outputs show the results of using `.rowCount()` and `.columnCount()`, `.countMissing()`, `.countBy()`, and `.mean()`, `.min()`, and `.max()`.

12. Let's practice!

Now you can practice assessing data quality with Tablesaw!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.