Evaluating modern data architecture solutions

1. Evaluating modern data architecture solutions

Now, let's review the solution!

2. Ingestion

As the robots can push the files, we decided to use that capability because data is coming in unpredictable patterns and from multiple devices. Additionally, a pull approach will require each device to expose the files. For example, using a network file system, but that's more complex.

3. Storage

Then, data will arrive in Cloud Storage. As the data is structured, maybe using a relational database or data warehouse and an ELT approach sounds better. Nonetheless, Cloud Storage will be cheaper, and it exposes the appropriate APIs to load data easily. We can still think of BigQuery due to its cheap storage cost. However, loading local files to BigQuery is limited to some megabytes. Additionally, as these records will be processed and stored further, we could have a life-cycle policy to use a cheaper storage class after some time.

4. Processing

There are two parts here: First, the processing for data quality and business rules. We decided to use Dataflow, but other services are equally valid options. The interesting discussion is around batch vs. streaming. Streaming is better because the files are coming without predictable patterns, and we need to process them as soon as possible. Another decision is to store intermediate data in Cloud Storage. However, the reason is simplicity. As it will be further processed, we can store it there for a short period and delete it. As this is a blob storage, if the schema changes in the future, we don't have to worry about updating any table schema. Still, we can use some binary format like parquet to keep types.

5. Processing: The model scores

The second part is how to add the model's scores. As we need all records or a summary that changes due to new records, it will be hard to keep track of everything in a single streaming job. Thus, we proposed that this processing happens in a batch way to reduce the complexity of the pipelines and make them easier to maintain. As this part won't produce the test results, we can replicate those results to the serving layer and then complement them with the scores.

6. Serving the data

When data is ready, we'd like to perform analytics. However, if we only serve our data in an analytical store like BigQuery, then the application that looks for single records won't be able to perform well. That's why we decided to additionally copy it into a NoSQL database. Even if the data is structured, we don't actually need to do transactions or complex queries. So it will be easier to scale the database to meet the increased demand.

7. Some other details

We didn't mention much about governance or orchestration. Nonetheless, keep in mind that we decided to quarantine poor-quality records. This is actually a data governance policy we are defining, and this will require us to think about how to treat such records. For instance, we'll need to have a set of jobs to review them and involve people in it, requiring us to send emails and orchestrate this validation. Even if we focused on a really specific flow, we still needed to consider several aspects. Thus, as requirements get more complex, we would like to refine these concepts and leverage them to properly manage our platform. Finally, keep in mind that this is one of the multiple approaches to solving the same problem. In the end, everything is a trade-off, and we should be able to understand them and decide which part of the trade-off is more important to us at the time.

8. Let's practice!

Let's keep refining our platform!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.