Shaping data with PySpark and SQL

1. Shaping data with PySpark and SQL

Welcome back. In the last lesson, we loaded our transactions dataset into Spark. Now let's shape that data by selecting columns, filtering rows, and querying it with SQL.

2. Shaping your data

We're continuing with our customer transactions data loaded as df. In practice, we rarely need every column or every row. Spark gives us methods like select() to choose columns and filter() to choose rows. Let's see how they work.

3. Selecting columns

Here we call select() and pass the column names we want to keep: ID, Customer ID, Transaction Amount, and Category, chaining with show() to preview the first five rows. Spark returns a new DataFrame with just those columns - the others are dropped.

4. Filtering rows

To filter rows, we first import functions from pyspark.sql and give it the short name F. Then we use F.col() to reference a column by name. This is the standard pattern for building filter conditions. Here we keep only completed transactions and save the result as filtered_df.

5. Combining filter conditions

We can combine multiple conditions using the ampersand symbol for AND. Each condition needs to be wrapped in parentheses so Spark evaluates them correctly. Here we filter for transactions over twenty thousand that are also completed. The output shows only high-value completed transactions.

6. Python methods or SQL

So far, we've used Python methods like select() and filter(). But Databricks also lets us query data using SQL. Both approaches work on the same data. Python methods are helpful for building transformations step by step, while SQL is often more readable for aggregations and grouping.

7. Creating a temporary view

To query our data with SQL, we first create a temporary view. A temporary view is simply a name that points to our DataFrame, so SQL knows what table to query. We call createOrReplaceTempView() and pass a name - here, transactions. This view lasts only for the current Spark session.

8. Querying with SQL

To run SQL, we create a new SQL cell in our notebook. This query counts completed transactions by category and orders them from highest to lowest. We reference our view name transactions in the FROM clause. Looking at the results, Clothing leads with about sixty-seven hundred completed transactions. Now let's look at some tools for validating our work.

9. Validating with display()

When working with DataFrames, we often need to check our results. Previously, we used show() to preview some rows. Databricks also gives us display(), which renders an interactive table. Here we pass our filtered_df from earlier. We can sort columns, page through results, and spot nulls or outliers quickly. Note that SQL cells already render results as interactive tables, so display() is mainly for PySpark DataFrames.

10. Inspecting query plans

Before Spark runs our code, it builds an execution plan. Calling explain() on any DataFrame shows us that plan. Here we see a Filter node confirming our condition, and a FileScan showing the data source. We can also use the EXPLAIN keyword before any SQL query to see the same plan. This is useful for understanding what Spark will do and for fixing performance issues.

11. Reading Spark errors

Errors happen, and Spark tries to help. Here we try to select a column called Account, which doesn't exist. Spark raises an error and suggests similar column names. When debugging, look for suggestions like these - they often point you to the fix.

12. Let's practice!

We've selected columns, filtered rows, and queried data using both Python and SQL. Now it's time to practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.