Get startedGet started for free

Data source selection

1. Data source selection

Let's talk about data source selection.

2. Why select?

We'll discuss why data selection is important and the steps to evaluate sources while integrating the responsible data dimensions in evaluation. With so many data source options, taking care when selecting and evaluating them ensures data quality, legal compliance, and fairness.

3. Step 1. Project relevance

Let's look at the steps to evaluate a data source that will help us make our selection. We start with relevance to the project objectives - data should help us get the answers we seek. We check for alignment with the subject area, scope, and anticipated outcomes. By prioritizing relevance, we ensure valuable resources are well-spent on relevant sources.

4. Step 2. Data source integrity

Once we have a batch of relevant data, we can assess the integrity and trustworthiness of the data provider and data collection methods. We can do this by looking for reviews, testimonials, and transparency in the data collection process and checking the documentation, compliance, and licensing. High-quality data providers regularly update their data.

5. Step 3. Legal compliance

Next, we focus on the lawfulness of the data and legal compliance in our project. With legal counsel, we identify the applicable laws and look for any restrictions. We also evaluate the anonymization and de-identification of data and assess any data security requirements. To ensure lawfulness, everything is approved by a legal team.

6. Step 4. Technical quality

Next up, we look at the technical quality of the data to check its structural integrity and usability. We ensure the data is complete, consistent, accurate, and timely. Identifying and correcting technical issues early in the project can prevent errors and inefficiencies in later stages.

7. Step 5. Bias and representativeness

By checking for bias and representativeness, we ensure that the data accurately reflects the diversity of the population and generalizes across different groups and contexts. We analyze demographic representation by comparing the data against the target population, paying attention to protected characteristics like gender and race. We analyze the distribution of demographic groups using chi-squared tests to identify biases and representation issues. We use fairness metrics, such as demographic parity, equality of opportunity, or equality of odds, to measure disparities in the data. With any limitations, we consider augmenting the data with additional data sources and consider the implications for modeling. We will talk more about that later.

8. Step 6. Selection

We're nearly ready to make our selection! If a data source consistently aligns with the project throughout all evaluation steps, we may directly include it in the project. Sources lacking in key areas like legal compliance or bias may be excluded to protect project integrity. If a data source is valuable but imperfect, we may still consider it through data transformation, augmentation, or algorithmic corrections. For the final decision, we consult domain experts if that is not us!

9. Urban traffic flow project

Let's consider data source selection for the urban traffic flow project. We have a list of five data sources, including traffic count data, council meeting notes, GPS tracking data, social media mentions of traffic conditions, and commuter survey data, including survey responses on commuters' daily travel times, preferred routes, and traffic experiences.

10. Urban traffic flow project

When we evaluate these, for social media data we are yet to apply for approval from the social media platform to outline the consent process. We exclude it for now. Commuter survey data should be excluded as well due to sampling bias. Council meeting notes lack representation and may disproportionately reflect more active community members. We consider including this data source with alterations. Traffic count and GPS tracking data pass initial evaluations but need further analysis for potential historical bias and representation. Our selection has been made, and we plan for data augmentation. For traffic data, we consider including additional data from newly installed traffic sensors in previously under-monitored areas. We also augment GPS data with traffic camera footage analysis to include information on pedestrian flows, cyclist movements, and areas not well covered by public transportation data.

11. Let's practice!

Excellent work. Now, let's practice!