Get startedGet started for free

Databases and quality checks

1. Databases and quality checks

Welcome to the discussion on databases and quality checks.

2. Triaging data sources

Not all data sources are equally useful. Some require substantial cleaning, analysis, and transformations, while others lack useful features. Data triaging is about quickly figuring out whether a dataset is worthwhile by looking at a few factors. First, availability. Is the data easy to access or locked behind a paywall? Do you need special permissions, or are there any security concerns? Figuring out these logistics upfront can save you from chasing dead ends.

3. Key checks

Next, consider costs. Data isn't always cheap. Are there fees, and will you need to pay for storage? What about software or tools to process it - will you need to invest in those? Then there's utility. Ask yourself: does this data have what you need? Think about its scope, the level of detail it offers, how complete it is, and whether its features are relevant to your problem. Another consideration is the frequency of updates. If you're working on a real-time prediction, infrequently updated data won't cut it. Make sure the update frequency matches your needs. Finally, consider the geographic resolution. If you're studying trends at the zip code level, the broad, nationwide data might not help. The granularity of the data needs to fit the scope of your project. By considering these factors, you can quickly decide whether a dataset is worth pursuing or should be left aside.

4. Data quality checks

Data quality is critical. "Garbage in, garbage out" is more than a phrase — it is a truism. Bad data can derail models, leading to laughably incorrect predictions and costly business decisions. When assessing data quality, here are some key areas to keep an eye on: Missingness Count: Missing values are like holes in your dataset that can trip up your model. Start by calculating the percentage of missing values in each column and checking for patterns. If the missingness is extensive, it might signal data collection issues - or suggest that filling in the gaps isn't worth the effort. Range Checks: Does the data make sense? For example, is it reasonable for someone's age to be 450 or for a product to have a price of -$10? Simple minimum and maximum checks can flag these errors before they sneak into models and wreak havoc.

5. More checks

Outlier Review: Outliers are data points that sit far outside the norm. They can be genuine insights or outright errors. Use visualizations like histograms or boxplots to spot them. Then, discuss these anomalies with the business stakeholders to decide if they're valid or need to be excluded. Timeliness: How current is your data? Stale data can lead to outdated conclusions and ineffective models. For fast-changing situations, ensure your data refresh rate keeps up with the pace of change. Formatting Consistency: Are dates formatted uniformly? Are text fields full of typos? Inconsistent formatting can create a nightmare during analysis, so ensure your data is clean and standardized. You'll set a strong foundation for building reliable and accurate models by addressing these points. Once you have triaged the data source and performed the appropriate quality checks, you are ready to decide which data sources will be most useful for this project and which are unsuitable or require a prohibitive amount of effort to clean.

6. Let's practice!

Let's put these ideas into practice!