The data pipeline
1. The data pipeline
Alright, we've mentioned the term pipeline several times by now, so let's focus on it for this lesson.2. If data is the new oil...
You may have heard that "Data is the new oil", as first coined by The Economist, so let's follow this idea.3. Oil well
We extract crude oil from an oil field.4. Oil well pipe
We move the crude oil to5. Distilling
a distillation unit, where we6. Residue
separate7. Heavy oil
the8. Diesel
oil9. Kerosene
into10. Napththa
several11. Gasoline
products. Some products are sent directly to their final users.12. Airport
For example, some pipes go straight to airports to deliver kerosene. Other products, like gasoline,13. Gas storage facility
are sent to gas storage facilities and stored in big tanks,14. Gas stations
before being distributed to gas stations.15. Naphtha is transformed
Other products, like naphtha, go through several chemical transformations,16. Factory receives plastic
Manufacturers use synthetic polymers to create products, like CDs. As you can see, we have many pipelines tying it all together.17. Back to data engineering
CDs? So last century, Vivian thinks. However, to manage data for Spotflix, she follows a procedure similar to oil processing. Companies ingest data from many different sources, which needs to be processed and stored in various ways. To handle that, we need data pipelines that efficiently automate the flow from one station to the next, so that data scientists can use up-to-date, accurate, relevant data. This isn't a simple task and that's why data engineers are so important.18. Mobile
At Spotflix, we have sources from which we extract data. For example, the users' actions and listening history on the mobile Spotflix app19. Computer
and the desktop Spotflix app,20. Website
and the Spotflix website itself. We also have websites Spotflix uses internally, like their HR management system for payroll and benefits.21. Ingesting artists data
The data is ingested22. Ingesting albums data
into23. Ingesting artists data
Spotflix's system,24. Data lake
moving from their respective sources to our data lake (no fear, we will talk about data lakes in the next chapter).25. First pipelines
These are our first three pipelines.26. Artists database
We then organize the data, moving it into databases (we will talk more about databases in Chapter 2 as well). It could be artist data, like name, number of followers, and associated acts,27. Albums database
albums data, like label, producer, year of release,28. Tracks database
tracks data, like name, length, featured artists, and number of listens,29. Playlists database
playlists data, like name, song it contains,and date of creation,30. Customers database
customers data, like username, account opening date, subscription tier,31. Employees database
or employees data, like name, salary, reporting manager, updated by human resources.32. Six more pipelines
These are six new pipelines.33. Album covers database
Some albums data can be extracted and stored directly. For example, album cover pictures all have the same format, so we can store them directly without having to crop them.34. One more pipeline
One more pipeline!35. Sales employees table
Employees could be split in different tables by department, for example sales,36. Engineering employees table
engineering,37. Support employees table
support, etc. We will talk about tables in Chapter 2 as well.38. Three more pipelines
For now, three more pipelines!39. Sales USA employees table
These tables could be further split by office, for example the US,40. Sales Belgium employees table
Belgium,41. Sales UK employees table
and the UK. If data scientists had to analyze employee data(to investigate employee turnover for example), this is the data they would use.42. Three more pipelines
Three more pipelines!43. Checking for corrupted tracks
Tracks would need to be processed, first to check if the track is readable, then to check if the corresponding artist is in the database, to make sure the file is in the correct size and format, etc.44. One more pipeline
That's one more pipeline, that we will unpack into Chapter 3 when we will talk about data processing.45. Clean tracks database
The data can then be stored in a new, clean tracks database. This is one of the databases data scientists could use to build a recommendation engine by analyzing songs for similarity, for example.46. One last pipeline
And that's our last pipeline!47. You get a pipeline!
Alright! That's48. You get a pipeline!
a lot49. Everybody gets a pipeline!
of pipelines!50. Data pipelines ensure an efficient flow of the data
In a nutshell, data pipelines ensure the data flows efficiently through the organization. They automate extracting, transforming, combining, validating, and loading data, to reduce human intervention and errors, and decrease the time it takes for data to flow through the organization. Don't worry, we'll cover this in detail in the last chapter.51. ETL and data pipelines
One term you will hear a lot is "ETL". It's a popular framework for designing data pipelines. It breaks up the flow of data into three sequential steps: first E for extracting the data, then T for transforming the data, and finally, L for loading this transformed data to a new database. The key here is that data is processed before it's stored. In general, data pipelines move data from one system to another. They may follow ETL, but not all the time. For instance, the data may not be transformed, and routed directly to applications like visualization tools or Salesforce.52. Summary
OK, now you understand what a data pipeline is, what it's used for, why it's important, how we use them at Spotflix, and where ETL fits in.53. Let's practice!
Let's solidify your understanding of data pipelines with a couple exercises, and then onwards to Chapter 2 to dive into the details data storage.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.