Get startedGet started for free

Data source limitations

1. Data source limitations

Unfortunately, no data source is perfect. Let's consider some limitations.

2. Data source and common limitations

Limitations are part of reality; understanding them allows us to work around them or look for alternatives. We'll look at limitations around legal compliance, bias, methodology, and the role of domain knowledge.

3. Legal and access-based limitations

A data source may restrict the use of data for particular types of projects, such as for commercial use, or may require additional compliance. The cost of access to the data may also be prohibitive, especially for smaller organizations or limited budgets.

4. Bias in data sources

Bias refers to systematic errors, distorted perceptions, or uneven outcomes for certain groups, such as gender bias in tech, racial disparity in healthcare, or socioeconomic bias in education access. They reflect uneven outcomes and disadvantage certain groups.

5. Types of bias

We will focus on three types of bias within the scope of data acquisition: historical, selection, and sampling bias. Historical bias originates from data that no longer accurately reflects the current reality. The data may be recent, but patterns and outcomes are irrelevant. Selection bias occurs when choosing which data points to include in the dataset. Sampling bias refers to the method by which the sample is drawn from the population, leading to data that does not accurately represent the population.

6. Bias and origin-based limitations

Data from specific sources often includes historical and selection biases. Such data sources may have restricted coverage, including cultural and geographical constraints, limiting their scope and inclusion. This may lead to generalization problems in modeling, affecting performance on unseen data and fairness metrics, resulting in significant performance gaps between overrepresented and underrepresented groups.

7. Bias and methodology-based limitations

Methodology-based constraints may also limit us. The choice of data collection methods and sampling approaches introduces these constraints. For example, in health research, survey data is likely to be collected from people who have volunteered themselves, which may lead to selection bias. Such participants are likely to have a higher health awareness than the general population. Similarly, with the survey distributed through a health app, the sampling method might introduce bias and skew results towards a more tech-savvy demographic.

8. Domain knowledge

Seeing limitations or hidden biases in the data source can be challenging. Domain knowledge is critical in uncovering these subtleties and seeing the mismatch between expectable, knowledge, and experience-based values and data on hand. Such expertise is invaluable, so engage with domain experts from the earlier stages of the project. Mitigate limitations before the modeling phase to prevent the amplification of biases and errors.

9. Urban traffic flow project

Let's go back to the urban traffic flow project and see the data sources' limitations and biases. Recall that we have three sources: historical traffic count data, city council meeting notes, and real-time GPS tracking data.

10. Urban traffic flow project

Historical traffic count data needs to be checked for historical bias and may contain past urban layouts, traffic management policies, or population density numbers that are no longer relevant. Sampling bias may be another consideration if traffic counts were only conducted during specific times or days of the week. Meeting notes data may include selection bias. Concerns captured in these minutes might disproportionately represent the views of more vocal or engaged community members, potentially overlooking quieter groups not well-represented in public consultations.

11. Urban traffic flow project

GPS Tracking Data can have sampling and selection biases. It is collected from public transport and mobile apps and may not represent all commuters, such as those not using the alternative methods of transportation or those less tech-adept.

12. Let's practice!

Now, let's do more practice!