Get startedGet started for free

Data sources

1. Data sources

Data fuels AI, but the source of that fuel can have a significant impact.

2. Coming up..

We've discussed the legal aspects of responsible data management. It's now time to dive into data sources. We'll explore different data source types in this video before diving into limitations, selection, and integrating multiple data sources in later videos.

3. Why data source is important

Think about data as an ingredient for your favorite recipe. Just as the source of ingredients affect the outcome, so does the source of data. The source of data determines its integrity, diversity, and fair representation. So, let's make sure we use the data from the right source!

4. Types by origin

Data sources can be classified by their origin: primary or secondary. Primary sources are data collected within the project, like surveys or trials. Of course, we already took care of compliance and consent! Secondary sources refer to the data acquired from existing resources, like public datasets, open data providers, and third parties. Here, we would consider licensing agreements.

5. Types by nature

We can also classify data sources by nature: quantitative, qualitative, or mixed. Quantitative sources can produce only numeric data, such as a satisfaction score. In contrast, non-numeric data comes from qualitative ones, such as a written review, and a combination of the two is the mixed type, such as in customer feedback surveys. The nature of the data source critically affects the project workflow, including preprocessing, modeling, and further deployment.

6. Types by temporality

Finally, we can classify data sources by temporality. If data does not change over time, it is considered static. If data updates in real-time, it is dynamic. Static data examples include census data and corporate addresses. For the dynamic data, consider social media streams, API sources, financial market feeds, or sensor data. Dynamic data requires a more proactive approach to monitoring, mitigating bias, and ensuring fairness, given the constant stream of information. This is not an exhaustive list of data sources; as the field evolves, more sources may become available or change, so it's important to keep up with current events.

7. Diversity and fairness in data sources

Understanding how the data source is classified allows us to evaluate the data's quality, bias, and fairness. Primary sources could reflect the data collectors' direct biases, while secondary sources could carry inherited biases from their original context. Quantitative data's numerical nature allows for precise, measurable bias checks, whereas qualitative data requires more nuanced analysis. Static data may not accurately represent current realities, carrying outdated biases, while dynamic sources continually evolve, possibly introducing real-time biases. A project can also use synthetic or generated data which is artificially created data.

8. Urban traffic flow project

Let's have a look at data sources in action. Here is an Urban traffic flow project. We develop an app to predict traffic in an urban area and optimize traffic management to reduce congestion during peak hours. Our data sources are historical traffic data, city council meeting notes, and real-time GPS tracking from mobile navigation apps and public transport.

9. Historical traffic data

The historical traffic data comes from the city's transportation department and covers the last five years. It includes vehicle counts and the time of day and week from intersections in the town. This is a primary static quantitative source.

10. Council meetings minutes

The council meeting notes are public records available on the council website and contain urban planning and traffic management summaries. This is a qualitative secondary source.

11. GPS data

GPS data is a primary dynamic source that provides immediate insights into current traffic conditions, speeds, and delays.

12. Let's practice!

Great! We've identified our ingredients and where they come from; we'll review their limitations next. But first, let's practice.