Data statistics

1. Data statistics

Hi, and welcome to this course about cleaning data in Java!

2. Meet your instructor!

I'm Dennis Lee, a software engineer in technology and operations, and I'll be your instructor.

3. Why clean data matters

Why is cleaning data important? Imagine running a bookstore where price tags randomly switch or print wrong - a $40 book might show as $400. This is dirty data in action, leading to lost sales, customer confusion, and unreliable forecasts. In this course, we'll develop systematic checks to keep our data clean.

4. Course outline

We'll start with assessing data quality,

5. Course outline

then transform data to standardize formats,

6. Course outline

validate data to enforce business rules,

7. Course outline

and clean more complex tabular data. In this video, we'll start with statistics to assess our data.

8. Structuring our data

Our journey begins with a book sales dataset. We'll store `BookSales` in a `record`. A record is a special class that automatically generates methods to access its fields, like `book.rating()`. The getters use the field names without the `get` prefix - so `.rating()` instead of `.getRating()`. Our `BookSales` `record` tracks each book's `title`, `publishDate`, `reviewCount`, `rating`, and `price`.

9. Populating the dataset

We populate our sample data using `Arrays.asList()`, creating `BookSales()` objects with `title`, `publishDate`, `reviewCount`, `rating`, and `price`. This gives us a dataset to practice statistical analysis.

10. Calculating mean, min, max

Just as we spot mismarked price tags in a bookstore, the `DescriptiveStatistics` class from Apache Commons helps us find problems in our data by using its methods like `.getMin()` or `.getMax()`. We create a `stats` object and use `.forEach()` to feed it each book's price through `.addValue()`.

11. Example output: price range

Then we analyze our data and display results. `System.out.printf()` formats output using placeholders (`%d` for integers, `%.2f` for 2 decimal places, `%n` for newlines). We use it to display our analysis: book count with `size()`, average price with `.getMean()`, and price range with `.getMin()`/`.getMax()`. Unusual values like "$400.00" instead of "$40.00" help identify data issues.

12. Calculating percentiles

To find unusual prices, we use percentiles. Using `.getPercentile(50)` gives us the median, or the middle price when all books are sorted by price. `.getPercentile(25)` and `.getPercentile(75)` show us the price range where half our books fall. These statistics become useful in larger datasets where we can flag values that fall outside these bounds for further inspection.

13. Statistics as quality control

We learned to use basic statistics to find data quality issues in our book sales dataset. We started with a `BookSales` class to structure our data, then used `DescriptiveStatistics` methods like `.getMean()`, `.getMin()`, `.getMax()`, and `.getPercentile()` to analyze prices. By comparing values against typical ranges and using simple bounds checks, we can quickly identify potential data that needs cleaning. These techniques scale well from our simple bookstore example to large, complex datasets where manual inspection isn't feasible.

14. Let's practice!

Now it's your turn to compute statistics on a dataset!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.