Turning Text into Tables

1. Turning Text into Tables

It's said that 80% of Machine Learning is data preparation. As we'll see in this lesson, this is particularly true for text data. Before you can use Machine Learning algorithms you need to take unstructured text data and create structure, ultimately transforming the data into a table.

2. One record per document

We start with a collection of documents. These documents might be anything from a short snippet of text, like an SMS or email, to a lengthy report or book. Each document will become a record in the table.

3. One document, many columns

The text in each document will be mapped to columns in the table. First the text is split into words or tokens. You then remove short or common words that do not convey too much information. The table will then indicate the number of times that each of the remaining words occurred in the text. This table is also known as a "term-document matrix". There are some nuances to the process, but that's the central idea.

4. A selection of children's books

Suppose that your documents are the names of children's books. The raw data might look like this. Your job will be to transform these data into a table with one row per document and a column for each of the words.

5. Removing punctuation

You're interested in words, not punctuation. You'll use regular expressions (or REGEX), a mini-language for pattern matching, to remove the punctuation symbols. Regular expressions is another big topic and outside of the scope of this course, but basically you are giving a list of symbols or text pattern to match. The hyphen is escaped by the backslashes because it has another meaning in the context of regular expressions. By escaping it you tell Spark to interpret the hyphen literally. You need to specify a column name, books.text, a pattern to be matched (stored in the variable REGEX), and the replacement text, which is simply a space. You now have some double spaces but you can use REGEX to clean those up too.

6. Text to tokens

Next you split the text into words or tokens. You create a tokenizer object, giving it the name of the input column containing the text and the output column which will contain the tokens. The tokenizer is then applied to the text using the transform() method. In the results you see a new column in which each document has been transformed into a list of words. As a side effect the words have all been reduced to lower case.

7. What are stop words?

Some words occur frequently in all of the documents. These common or "stop" words convey very little information, so you will also remove them using an instance of the StopWordsRemover class. This contains a list of stop words which can be customized if necessary.

8. Removing stop words

Since you didn't give the input and output column names earlier, you specify them now and then apply the transform method. You could also have given these names when you created the remover.

9. Feature hashing

Your documents might contain a large variety of words, so in principle our table could end up with an enormous number of columns, many of which would be only sparsely populated. It would also be handy to convert the words into numbers. Enter the hashing trick, which in simple terms converts words into numbers. You create an instance of the HashingTF class, providing the names of the input and output columns. You also give the number of features, which is effectively the largest number that will be produced by the hashing trick. This needs to be sufficiently big to capture the diversity in the words. The output in the hash column is presented in sparse format, which we will talk about more later on. For the moment though it's enough to note that there are two lists. The first list contains the hashed values and the second list indicates how many times each of those values occurs. For example, in the first document the word "long" has a hash of 8 and occurs twice. Similarly, the word "five" has a hash of 6 and occurs once in each of the last two documents.

10. Dealing with common words

The final step is to account for some words occurring frequently across many documents. If a word appears in many documents then it's probably going to be less useful for building a classifier. We want to weight the number of counts for a word in a particular document against how frequently that word occurs across all documents. To do this you reduce the effective count for more common words, giving what is known as the "inverse document frequency". Inverse document frequency is generated by the IDF class, which is first fit to the hashed data and then used to generate weighted counts. The word "five", for example, occurs in multiple documents, so its effective frequency is reduced. Conversely, the word "long" only occurs in one document, so its effective frequency is increased.

11. Text ready for Machine Learning!

The inverse document frequencies are precisely what we need for building a Machine Learning model. Let's do that with the SMS data.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.