Introduction to PySpark DataFrames

1. Introduction to PySpark DataFrames

Welcome back! We've already started with DataFrames in PySpark in the previous video, but now, let's dive in deeper.

2. About DataFrames

While we may be familiar with DataFrames from pandas, the key difference in Spark is how the data is distributed. Pandas operates on a single compute instance, while PySpark distributes data across multiple instances, affecting processing speed and data scalability. DataFrames are essential in PySpark for efficiently managing large-scale data across clusters. While they resemble Pandas DataFrames, they are designed for much larger datasets. We'll frequently interact with data using PySpark DataFrames, which support various manipulation tasks, such as filtering, grouping, and aggregating on distributed data, making them vital for big data analytics. Additionally, DataFrames support SQL-like operations on tables. As we go through this course, you'll probably notice similarities between pandas DataFrames syntax and PySpark DataFrames syntax. Bear in mind, DataFrames in Pyspark operating slightly differently

3. Creating DataFrames from filestores

Let’s start by creating a DataFrame in PySpark. A common method for loading data is using `spark.read.csv()`, which easily reads CSV files into a PySpark DataFrame, allowing us to define headers and automatically infer schema types. This code loads a CSV, treating the first row as headers and inferring data types for each column. We also include `header=True` and `inferSchema=True`, which are True for this particular table but are just two of many arguments available.

4. Printing the DataFrame

Using the `.show()` method, we can display the first five rows of the DataFrame. While we can also create DataFrames using the `createDataFrame()` function, `spark.read.csv()` is generally faster, offering significant speed improvements at scale, especially as we gain real-world experience with big data.

5. Printing DataFrame Schema

To inspect the schema, use `.printSchema()` to view the DataFrame structure.

6. Basic analytics on PySpark DataFrames

Once we've loaded data into a DataFrame, we can perform basic analytics like aggregations, which summarize data by counting rows, summing values, or calculating averages. These will return an integer or float, typically. For instance, to count the rows in a DataFrame, use the `.count()` method. For more advanced summaries, we can employ the `.groupBy()` and `.agg()` methods. For example, to group data by a column and calculate the average of another, we can use various aggregation functions like standard deviation, sum, and more that you are probably already familiar with. This approach groups data by "gender" and computes the average of "salary_usd", showing how to combine steps for data summarization and quick insights.

7. Key functions for PySpark analytics

Some of the most important methods we'll use with DataFrames are `groupBy()`, `agg()`, `filter()`, and `select()`. Let's start with `select()`, which operates like it does in SQL where it will specific named columns. 'filter()` operates like SQL's `WHERE`. `groupby()` is also like SQL's same named keyword. `agg()` will take a function like `sum()` and the specific columns we're interest in. In each case, we need to pass the DataFrame name and the columns as a list.

8. Key Functions For Example

For example, we can filter the DataFrame for rows where the value in the age column is greater than 50 and select only those columns.

9. Let's practice!

Let's practice DataFrames in PySpark!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.