Data integration

1. Data integration

Using multiple data sources can result in more diverse datasets and improved modeling. Let's see how to integrate them into a project.

2. What we will cover

Let's start by looking at why integration is necessary, its benefits, and its complications, and then review steps to ensure data integration goes smoothly.

3. Why have multiple sources?

By integrating multiple datasets, we combine information from different sources to create a more comprehensive and detailed view. A single data source may not have enough depth, especially for a complex project. Reliance on a single data source makes models vulnerable to data availability, quality, bias, and representation. Multiple sources provide a safety net. When used correctly, multiple data sources boost responsible data dimensions, particularly diversity, fairness, overall explainability, transparency, and accountability. For example, integrating stock market data, economic indicators, and company financials in a financial market app enhances stock performance predictions by including company-specific and macroeconomic insights.

4. Beware of issues

Poor integration can compromise data quality, introduce inconsistencies, amplify biases, and reduce representation. If done badly, it can have the opposite effect and increase model complexity while reducing transparency and explainability. Let's look at best practices for integrating data sources.

5. Step 1. Data sources selection

We start with data source selection, following the evaluation steps previously discussed, and assess the data sources for their ability to contribute to a more balanced and comprehensive dataset.

6. Step 2. Aligning data types

Next, we align data types across sources to ensure that data works together. We standardize datasets by identifying common variables, standardizing names and formats, and normalizing numerical and consolidating categorical data. We align the granularity of data by temporal and geographical alignment. We run a model using the unified data to check for any more possible sources of bias and representation issues.

7. Step 3. Bias and representation enhancement

We address any that appear with weighting and balancing techniques. Weighting requires domain knowledge about the target population and appropriately assigning weights to under or overrepresented groups. With balancing, we modify the data to get an equal representation of different groups using oversampling or undersampling techniques. We then do algorithmic checks by comparing representation and outcomes across different groups and plan further interventions, such as reweighting or data augmentation using synthetic data generation. Post-integration, we conduct a gap analysis by searching for remaining issues and repeating the adjustment procedures if needed.

8. Step 4. Document

For technical, accountability, and transparency reasons, we keep detailed records of data integration steps, including decisions made to address biases and enhance representation. We develop rich metadata that includes information on the sources, collection methodology, and applied transformations so that, later, we can trace the lineage of data, context, and limitations.

9. Urban traffic flow project

Let's integrate data sources into the urban traffic flow project. We already selected our data sources, so let's align them. We identify common features in the data, such as location identifiers and time stamps, and merge the quantitative traffic counts with qualitative urban planning insights and dynamic traffic conditions. We developed a Unified Data Model that maps all three data sources. This model defines GPS coordinates for location identifiers and ISO 8601 format for timestamps to ensure consistency across datasets. It defines qualitative categories, such as "public event" and "construction project" to align them with location and time data.

10. Urban traffic flow project

We use statistical techniques to ensure a balanced dataset, watching for potential bias towards certain times of day or geographical areas. We identify underrepresented traffic patterns, such as those in suburban areas or non-peak hours, and apply weighting adjustments. Conducting a gap analysis, we identify emerging residential areas that are still underrepresented and do the reweighting. We carefully document all data sources and transformations and develop metadata for accountability and transparency.

11. Let's practice!

Fantastic integration! Let's practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.