Get startedGet started for free

Transforming data with pandas

1. Transforming data with pandas

Now that you've extracted data from a source system, it's time to transform it.

2. Transforming data in a pipeline

In a data pipeline, transforming data correctly is one of the most important measures of success of a data solution. If data is curated in an incorrect format, or information is lost through the transformation process, downstream users may not be able to generate value from the output of a data pipeline. Using pandas, it's easy to transform data. pandas provides powerful tools to manipulate tabular data, most of which can be executed in a single line of code. With pandas, the sky's the limit. Rows can be filtered, columns can be created, data types can be changed, and so much more. We'll explore loc and to_datetime to transform data.

3. Filtering records with .loc[]

loc is used to filter DataFrames by row and column values, and is one of the most widely-used tools for manipulating DataFrames. In our first statement, loc is called on the raw_stock_data DataFrame to keep only records with an "open" value greater than zero. The statement inside the brackets creates an array of boolean values, with entry True if "open" is greater than 0, and False otherwise. The colon after the comma signifies that all columns should be included in the resulting DataFrame. In our second example, loc is used to filter all columns except "timestamps", "open", and "close". All rows are maintained. These two statements can be combined into one! In a single line of code, all records with "open" greater than zero, and columns with labels "timestamps", "open", and "close" are preserved. loc has a counterpart, iloc. iloc uses integer indexing, rather than loc's label-based and boolean indexing. Here, iloc is used to return the first fifty rows and three columns of the DataFrame.

4. Altering data types

Often, DataFrames contain columns whose data types need to be converted to a different type. In particular, we'll take a look at transforming columns to type datetime using the to_datetime function. The to_datetime function takes a single column to be converted to type datetime. Here, the "timestamps" column is converted from a string in the format "YYYYmmddHHMMSS" to type datetime, using the format-string passed. Our second example shows an integer also being converted to type datetime. This integer represents the number of milliseconds that have elapsed since January 1, 1970, sometimes called a "Unix timestamp". Timestamps are often the most difficult data type to work with, but pandas built-in functionality makes this a little easier. Besides to_datetime, pandas offers several other tools to wrangle and transform data types.

5. Validating transformations

Any time data is manipulated, there's a risk of losing valuable information, or creating faulty data. Validating transformation helps to mitigate these risks. For now, we'll explore a few tools to manually inspect a DataFrame. The head method is a great tool for outputting the first few rows of a DataFrame. By default, the head method displays the first five rows of the DataFrame, but any integer can be passed. This allows for a Data Engineer to quickly check the results of any transformations that were made, and perform a manual "spot-check" as a first form of validation. In addition to the head method, the nsmallest and nlargest methods can be used to show the n-smallest or largest values in a list of columns passed. These tools are especially useful when filtering data using a range.

6. Let's practice!

Phew, that was a lot. Let's take some time to hone our skills with some practice!