Get Started

Creating and querying a SQL table in Spark

1. Spark SQL

Hello and welcome to this lesson about Spark SQL. Spark provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

2. Create SQL table and query it

In this lesson we will create a SQL table from a dataframe and then query it.

3. Load a dataframe from file

To load a comma-separated file into a dataframe called df, let df = spark.read.csv, where the first argument gives the name of the file. If the first line of the data gives the column names, set the header argument to True.

4. Create SQL table and query it

Now, to create a Spark table from this dataframe, do df.createOrReplaceTempView, with the first argument giving the desired name of the table. Next, to run a query, do spark.sql, with the first argument giving the query.

5. Inspecting table schema

To inspect the column names of this table set result = spark.sql, with the first argument set to the string “show columns from table” Another way is to use the query “SELECT * FROM table LIMIT 0” Yet another way is the “DESCRIBE table” query. To see the column names of result=spark.sql(query), we can either do result.show() and visually inspect the result, or, do print result.columns.

6. Dataframe

The dataframe is a fundamental data abstraction in Spark.

7. Dataframe

A Spark DataFrame is a distributed collection of data organized into named columns.

8. Tabular data

It is conceptually equivalent to a table in a relational database, also called, simply, “tabular” data.

9. Two dataframes

We could have two dataframes having the same types of columns, and containing different data.

10. Two dataframes

11. Two dataframes concatenated

We could then concatenate the rows of data in these two tables into a single dataframe. Recall that a Spark DataFrame is a distributed collection of data organized into named columns. What do we mean by “distributed”?

12. Two dataframes (1)

Spark can split this dataset

13. Two dataframes (2)

into parts

14. Two dataframes (3)

then store

15. Two dataframes (4)

each part

16. Two dataframes (5)

on a

17. Two dataframes (6)

different server.

18. Two dataframes - distributed

In this case, Spark is partitioning the data and distributing it automatically, on our behalf. This is one technique that Spark uses to handle large datasets, even though each server may not have enough storage to hold the entire dataset on its own. What’s more, Spark allows us to treat a dataframe like a table, and query it using SQL.

19. SQL

SQL

20. Structured Query Language

stands for “Structured Query Language”

21. Querying

A query tells the computer what to fetch.

22. Distributed data

What’s useful about the Spark SQL table is that it allows us to take the data that is in a dataframe, namely, a distributed collection of rows having named columns, and treat it as a single table,

23. Distributed data + query

and fetch data from it using an SQL query.

24. Loading delimited text

We often use an instance of a SparkSession object. By convention this is provided in a variable called "spark". Some implementations of Spark, such as Pyspark Shell, automatically provide an instance of a SparkSession.

25. Loading delimited text

The following Spark command reads delimited text data into a dataframe from a file. One of its options allows it to use the first row to define the names of the columns. It automatically splits each row into columns using the delimiter, which by default is a comma but which can be changed.

26. Pyspark

Let’s load

27. Pyspark cursor

some data

28. Pyspark

into

29. Pyspark cursor

a dataframe

30. Pyspark

convert it

31. Pyspark cursor

into

32. Pyspark

a sql

33. Pyspark cursor

table

34. Pyspark

and

35. Pyspark cursor

query

36. Pyspark

it.

37. Pyspark cursor

38. Let's practice