1. Extract
You made it to the third chapter. Impressive work so far. This chapter covers a concept that we often refer to as ETL in data engineering. ETL stands for Extract, Transform, and Load. We'll have one lesson on each of these steps. In the final lesson, we'll set up an ETL process using a scheduler we saw in the previous chapter.
2. Extracting data: what does it mean?
This first lesson is about data extraction or the extract phase in ETL. Now, what do we mean by extracting data? Very roughly, this means extracting data from persistent storage, which is not suited for data processing, into memory. Persistent storage could be a file on Amazon S3, for example, or a SQL database. It's the necessary stage before we can start transforming the data. The sources to extract from vary.
3. Extract from text files
First of all, we can extract data from plain text files. These are files that are generally readable for people. They can be unstructured, like a chapter from a book like Moby Dick. Alternatively, these can be flat files, where each row is a record, and each column is an attribute of the records. In the latter, we represent data in a tabular format. Typical examples of flat files are comma-, or tab-separated files: .csv or .tsv. They use commas (,) or tabs respectively to separate columns.
4. JSON
Another widespread data format is called JSON, or JavaScript Object Notation. JSON files hold information in a semi-structured way. It consists of 4 atomic data types: number, string, boolean and null. There are also 2 composite data types: array and object. You could compare it to a dictionary in Python. JSON objects can be very nested, like in this example.
There's a pretty good mapping from JSON objects to dictionaries in Python. There's a package in the standard library called `json`, which helps you parse JSON data. The function `json.loads` helps you with this. The reason JSON got popular is that in recent days, many web services use this data format to communicate data.
5. Data on the Web
At this point, it makes sense to do a crash course on the web. On the web, most communication happens to something called 'requests.' You can look at a request as a 'request for data.' A request gets a response. For example, if you browse Google in your web browser, your browser requests the content of the Google home page. Google servers respond with the data that makes up the page.
6. Data on the Web through APIs
However, some web servers don't serve web pages that need to be readable by humans. Some serve data in a JSON data format. We call these servers APIs or application programming interfaces. The popular social media tool, Twitter, hosts an API that provides us with information on tweets in JSON format. Using the Twitter API does not tell us anything about how Twitter stores its data; it merely provides us with a structured way of querying their data.
Let's look at another example request to the Hackernews API and the resulting JSON response. As you can see, you can use the Python package, `requests` to request an API. We will use the `.get()` method and pass an URL. The resulting response object has a built-in helper method called `.json()` to parse the incoming JSON and transform it into a Python object.
7. Data in databases
Finally, we have to talk about databases. The most common way of data extraction is extraction from existing application databases. Most applications, like web services, need a database to back them up and persist data.
At this point, it's essential to make a distinction between two main database types.
Databases that applications like web services use, are typically optimized for having lots of transactions. A transaction typically changes or inserts rows, or records, in the database. For example, let's say we have a customer database. Each row, or record, represents data for one specific customer. A transaction could add a customer to the database, or change their address. These kinds of transactional databases are called OLTP, or online transaction processing. They are typically row-oriented, in which the system adds data per rows. In contrast, databases optimized for analysis are called OLAP, or online analytical processing. They are often column-oriented. We'll talk more about this later in this chapter.
8. Extraction from databases
To extract data from a database in Python, you'll always need a connection string. The connection string or connection URI is a string that holds information on how to connect to a database. It typically contains the database type, for example, PostgreSQL, the username and password, the host and port and the database name. In Python, you'd use this connection URI with a package like `sqlalchemy` to create a database engine. We can pass this engine object to several packages that support it to interact with the database. The example shows the usage with `pandas`.
9. Let's practice!
Now that know saw the extract phase, let's look at some exercises!