Get startedGet started for free

Data Visualization in PySpark using DataFrames

1. Data Visualization in PySpark using DataFrames

Visualization is an essential part of data analysis. In this video, we will explore some visualization methods that can help us make sense of our data in PySpark DataFrames.

2. What is Data visualization?

Data visualization is the way of representing your data in form of graphs or charts. It is considered a crucial component of Exploratory Data Analysis (EDA). Several open source tools exist to aid visualization in Python such as matplotlib, Seaborn, Bokeh etc. However, none of these visualization tools can be used directly with PySpark's DataFrames. Currently, there are three different methods available to create charts using PySpark DataFrames - pyspark_dist_explore library, toPandas method, and HandySpark toPandas. Let's understand each of these methods with examples.

3. Data Visualization using Pyspark_dist_explore

Pyspark_dist_explore is a plotting library to get quick insights on data in PySpark DataFrames. There are 3 functions available in Pyspark_dist_explore to create matplotlib graphs while minimizing the amount of computation needed - hist, distplot and pandas_histogram. Here is an example of creating a histogram using the Pyspark_dist_explore package on the test_df data. First, the CSV file is loaded into Spark DataFrame using the SparkSession's read-dot-csv method. Then we select the age column from the test_df DataFrame using the select operation. Finally we use the hist function of the Pyspark_dist_explore package to plot a histogram of 'Age' in the test_df_age dataset. The second method

4. Using Pandas for plotting DataFrames

of creating charts is by using toPandas on PySpark DataFrames which converts the PySpark DataFrame into a Pandas DataFrame. After conversion, it's easy to create charts from pandas DataFrames using matplotlib or seaborn plotting tools. In this example, first, the CSV is loaded in Spark DataFrame using read-dot-csv method. Next, using toPandas method, we will convert the Spark DataFrame into Pandas DataFrame. Finally, we will create a histogram of the "Age" column using matplotlib's hist method. Before we look at the third method, let's take a look at the differences between Pandas

5. Pandas DataFrame vs PySpark DataFrame

vs Spark DataFrames. But, Pandas won’t work in every case. It is a single machine tool and constrained by single machine limits. So their size is limited by your server memory, and you will process them with the power of a single server. In contrast, operations on Pyspark DataFrames run parallel on different nodes in the cluster. In pandas DataFrames, we get the result as soon as we apply any operation Whereas operations in PySpark DataFrames are lazy in nature. You can change a Pandas DataFrame using methods. We can’t change a PySpark DataFrame due to its immutable property. Finally, the Pandas API supports more operations than PySpark DataFrames. The final method of

6. HandySpark method of visualization

creating charts is using HandySpark libary, which is a relatively a new package. HandySpark is designed to improve PySpark user experience, especially when it comes to exploratory data analysis, including visualization capabilities. It makes fetching data or computing statistics for columns really easy, returning pandas objects straight away. It brings the long-missing capability of plotting data while retaining the advantage of performing the distributed computation. Here is an example of the HandySpark method for creating a histogram. Just like before, we load the CSV into a PySpark DataFrame using SparkSession's read-dot-csv method. After creating the DataFrame, we convert the DataFrame to a HandySpark DataFrame using the toHandy method. Finally, we create a histogram of the Age column using the hist function of HandySpark library. We have learned three exciting

7. Let's visualize DataFrames

methods of visualizing PySpark DataFrames and let's practice creating some charts with them now on real-world datasets now.