Loading natural language text

1. Loading natural language text

Hello and welcome to chapter two. The first step to working with natural language processing is to load your text data. You will load natural language text into a dataframe, while discarding unwanted data.

2. The dataset

We're using a public domain text called "The Project Gutenberg eBook of The Adventures of Sherlock Holmes". Project Gutenberg offers thousands of free eBooks. These are a great source of natural language text. To learn more go to gutenberg.org.

3. Loading text

To load text do spark.read.text()). The first argument gives the file path. df.first()) gets the first row, and df.count() counts the rows.

4. Loading parquet

The spark.read operation supports multiple formats. For example, use spark.read.load to load a parquet file. Parquet is the Hadoop file format to store data structures.

5. Loaded text

The following show command prints the first 15 rows. Setting truncate=False lets it print longer rows.

6. Lower case operation

The lower operation converts a column to lower case. It calls the result "lower(value)".

7. Alias operation

The alias operation allows us to give a new column a simpler name.

8. Replacing text

We plan on eventually removing all punctuation that separates sentences. But first we'll handle punctuation that is embedded in contractions. The operation regexp_replace replaces values that match a pattern. The first argument is the column name. The second argument is the pattern to be replaced. It replaces every occurrence of the second argument with the third argument. To prevent the period from being interpreted as a special character in the second argument, we put a backslash in front of it. This is called "escaping" it. We must also escape other special characters such as a single quote. ''

9. Tokenizing text

The split operation separates a string into individual tokens. The second argument gives the list of characters on which to split. Here it is a space.

10. Tokenizing text – output

It returns an array of strings. Notice how punctuation is messing up some words, such as "welcome", and "texts".

11. Split characters are discarded

Splitting on unwanted symbols in addition to spaces discards the unwanted symbols. '' ''

12. Split characters are discarded – output

"welcome" and "texts" no longer have asterisks.

13. Exploding an array

explode() takes an array of things, and puts each thing on its own row, preserving the order.

14. Exploding an array – output

The following puts every word into its own row.

15. Explode increases row count

The previous command increased the number of rows from 5500 to over 131 thousand.

16. Removing empty rows

To remove empty rows, use the length operation as the condition for a where operation. Notice how we originally count 131,404 rows. After removing all blank rows, we count only 107,320 rows.

17. Adding a row id column

The monotonically_increasing_id() operation efficiently creates a column of integers that are always increasing.

18. Adding a row id column – output

Here we are using it to create a column of unique IDs for each row.

19. Partitioning the data

Partitioning allows Spark to parallelize operations. We will organize the data allow window functions to use the partition clause. The when/otherwise operation is a case statement. The first argument gives the condition. The second argument gives the desired value for the column. You can chain multiple when() operations. The last when() operation is followed by an otherwise() clause that gives the column value used if none of the previous conditions applies. When combined with the withColumn operation when/otherwise groups the data into chapters. Repeating this adds a part id column.

20. Partitioning the data – output

We can now tell Spark to split this data into parts.

21. Repartitioning on a column

The first line repartitions the data in df, creating a new dataframe, df2. The first argument gives the desired number of partitions, here 4. The second argument is a column, saying, "put rows having the same part column value into the same partition." rdd.getNumPartitions gives the number of partitions.

22. Reading pre-partitioned text

Suppose you had a folder named sherlock_parts, containing 14 files.

23. Reading pre-partitioned text

spark.read.text() tells Spark to load all of the text files in the folder into a dataframe. If available parallelism is more than one and the folder contains more than one file, this reads the files in parallel and distributes the files over multiple partitions.

24. Let's practice!

Now it's your turn to load and modify text data!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.