1. Spark SQL
Hello and welcome to this lesson about Spark SQL.
Spark provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
2. Create SQL table and query it
In this lesson we will create a SQL table from a dataframe
and then query it.
3. Load a dataframe from file
To load a comma-separated file into a dataframe called df, let df = spark.read.csv, where the first argument gives the name of the file.
If the first line of the data gives the column names, set the header argument to True.
4. Create SQL table and query it
Now, to create a Spark table from this dataframe, do df.createOrReplaceTempView, with the first argument giving the desired name of the table.
Next, to run a query, do spark.sql, with the first argument giving the query.
5. Inspecting table schema
To inspect the column names of this table set result = spark.sql, with the first argument set to the string “show columns from table”
Another way is to use the query “SELECT * FROM table LIMIT 0”
Yet another way is the “DESCRIBE table” query.
To see the column names of result=spark.sql(query), we can either do result.show() and visually inspect the result, or,
do print result.columns.
6. Dataframe
The dataframe is a fundamental data abstraction in Spark.
7. Dataframe
A Spark DataFrame is a distributed collection of data organized into named columns.
8. Tabular data
It is conceptually equivalent to a table in a relational database, also called, simply, “tabular” data.
9. Two dataframes
We could have two dataframes having the same types of columns, and containing different data.
10. Two dataframes
11. Two dataframes concatenated
We could then concatenate the rows of data in these two tables into a single dataframe. Recall that a Spark DataFrame is a distributed collection of data organized into named columns. What do we mean by “distributed”?
12. Two dataframes (1)
Spark can split this dataset
13. Two dataframes (2)
into parts
14. Two dataframes (3)
then store
15. Two dataframes (4)
each part
16. Two dataframes (5)
on a
17. Two dataframes (6)
different server.
18. Two dataframes - distributed
In this case, Spark is partitioning the data and distributing it automatically, on our behalf. This is one technique that Spark uses to handle large datasets, even though each server may not have enough storage to hold the entire dataset on its own. What’s more, Spark allows us to treat a dataframe like a table, and query it using SQL.
19. SQL
SQL
20. Structured Query Language
stands for “Structured Query Language”
21. Querying
A query tells the computer what to fetch.
22. Distributed data
What’s useful about the Spark SQL table is that it allows us to take the data that is in a dataframe, namely, a distributed collection of rows having named columns, and treat it as a single table,
23. Distributed data + query
and fetch data from it using an SQL query.
24. Loading delimited text
We often use an instance of a SparkSession object. By convention this is provided in a variable called "spark". Some implementations of Spark, such as Pyspark Shell, automatically provide an instance of a SparkSession.
25. Loading delimited text
The following Spark command reads delimited text data into a dataframe from a file. One of its options allows it to use the first row to define the names of the columns.
It automatically splits each row into columns using the delimiter, which by default is a comma but which can be changed.
26. Pyspark
Let’s load
27. Pyspark cursor
some data
28. Pyspark
into
29. Pyspark cursor
a dataframe
30. Pyspark
convert it
31. Pyspark cursor
into
32. Pyspark
a sql
33. Pyspark cursor
table
34. Pyspark
and
35. Pyspark cursor
query
36. Pyspark
it.
37. Pyspark cursor
38. Let's practice