Get startedGet started for free

Designing a modern data architecture

1. Designing a modern data architecture

Let's see how all of these come together!

2. The business case

Imagine the following scenario: We're consultants for a medical laboratory with some organizer robots that manage their samples. These machines can recognize samples and the owners and produce CSV files that can be sent to a given endpoint. The laboratory wants to process the data as soon as it's generated, however, the machines can generate a file every 15 minutes. Additionally, the company has been strongly investing in research, thus, they have multiple databases stored as structured plain files. Their idea is to enable a platform so that patients can track all their samples and results. Also, they want to correlate their patients' results with the databases to enrich the reports. Therefore, analysts will need to access data too.

3. Where to start?

This whole case may sound scary, but let's start by understanding it better. As consultants, our job is to ask questions, understand the customers, and make them happy with our solutions. So, we need to investigate and ask! How large are the files generated by the robot machine? Are there many of them? How many? Similar questions for their databases. Also, what kind of analysis will they perform? Machine learning models? Running queries? How will they consume the data? They mention they want their patients to be able to track every sample and result, so maybe a web application? And what about their regulations? This is a healthcare customer, so they probably will have more strict regulations. We need to understand what data we'll process, as medical data implies special treatment of it.

4. The assumptions

Let's answer some of the previous questions: Our customer has a hundred of those robots and plans to acquire even more. Each robot generates a CSV file of around 100MB. Their databases are structured plain files of tens of gigabytes. For their analytics, they built some models that need to process each laboratory result and produce a set of scores. They have abstracted this under an API, but a laboratory result is composed of multiple records that may come from different CSVs, and all currently existing records are needed or at least a summary of them. However, incoming CSV files can modify information from existing results to refine them. Additionally, the laboratory results and the reports enriched with the models will be consumed by the patients in a mobile application. Finally, about regulations, we'll ignore that at this time.

5. The solution

Let's check an initial proposal that we'll refine further. As the machines are capable of pushing the files to a given endpoint, let's take advantage of it and use Cloud Storage to receive those files. This will act as our landing zone. To process the files, we can leverage pub/sub notifications to send a message to a streaming pipeline after a new file arrives. For the processing, we'll use Dataflow for the simplicity of serverless. Thus, when a file gets into our landing zone, we'll process it to do a couple of things: First, run our data quality checks that will isolate bad records into a quarantine zone to be further reviewed. Second, it will store the valid records in Cloud Storage again after applying the required business rules. Then, we'll have an hourly batch job that will consume processed records, apply the scores from the model, and store the data in 2 places: 1) BigQuery for analytical purposes, and 2) A No-SQL database to horizontally scale meeting the mobile app traffic. This is feasible as there are no transactions here.

6. Let's practice!

Let's design modern data architectures!