Get startedGet started for free

Getting the right data

1. Getting the right data

Having developed the right culture and onboarded the teams with the required skill sets, our next step is to ask the crucial question on which the success of every AI project hinges – what data is needed to drive the business outcome?

2. Data availability

Determining data availability is paramount to assess whether the model can capture underlying patterns to achieve desired AI results. In the era of big data, it is often assumed that more data would imply better model outcomes. However, the sheer quantity of data is not the sole metric that drives superior results. Starting from the correct data and its relevance to the business problem, its labeling, quality, and timeliness play a significant role, giving rise to the discipline of data-centric science. Andrew Ng describes data-centric science as systematically engineering the data to build a successful AI system.

3. Data relevance

The richness and relevance of patterns make a big difference. It includes identifying if the data directly relates to the problem statement. Irrelevant data can mislead the model, damaging its learning process and driving down its accuracy. Take the credit scoring model, for example, which requires relevant attributes such as the customer’s transaction history and assets profile.

4. Time relevancy

Relevancy is also associated with time; for example, supply chain data before the pandemic might not be relevant anymore to accurately reflect current dynamics and trends.

5. Data privacy

The availability of relevant data is a good start. But if it contains sensitive user data, we must adhere to data privacy standards like the General Data Protection Regulation, or GDPR, to ensure ethical AI practices, enabling user trust.

6. Data dictionary

Building a data dictionary early in the project is crucial – as it helps understand the meaning of different data fields and their significance in pattern recognition. Domain experts help link the impact of such attributes on business decisions. The data fields that do not add value to model predictions or pass redundant information are unnecessary in training data.

7. Data sampling

Working with a lot of data can be costly from both time and budget perspectives. Hence, a small set of the entire data is picked to estimate the original business problem reasonably. Creating a sampled dataset can produce comparable results more economically by using sampling techniques.

8. Data augmentation

On the other hand, if sufficient data is not readily available, one option is to wait and gradually collect data over time, which could lead to a loss of opportunity. The other approach is data augmentation, which can be applied to various data types. It involves artificially creating new data records by modifying existing datasets through minor changes.

9. Data diversity

Be it data sampling or augmentation, care must be taken to ensure diversity within the data. Diverse datasets create reliable models with good model accuracy. For instance, ensuring that loan applicants from different age groups and ethnicities are included in training data is essential to build a comprehensive model.

10. Data quality

Data quality involves assessing data on multiple aspects, such as: Is the data complete and comprehensive to describe the modeling process? Is it accurate? For example, the loan applicant's age and date of birth can be assessed to check whether the data is accurate and consistent. Is there any missing data? Is the data labeled correctly?

11. Let's practice!

Great, we have learned how to identify the relevant data to build an AI model. It is time to practice.