Text Classification

1. Text Classification

Now that you know how to create feature data, let's use it to train a machine learning model. We're going to pose the question : how hard is it to finish someone else's sentence?

2. Endword Prediction

We have been working with sequences and with text. Now we're going to combine these. We will predict the last word of a sentence.

3. Sequence arrow

Suppose you have a sequence of things and you want to guess its last item.

4. Endword

This could be a sequence of words.

5. Endword bracket

Machine learning can look at previous words and guess the next. Note that the technique we are going to use is not sensitive to the order of the previous words.

6. Shuffle 1

The previous words are now in a different order.

7. Shuffle 2

Here again, the previous words are in another different order. The technique we are going to learn now would make the same guess for each of these three cases. This can be beneficial for applications where the exact order may not matter, such as predicting the next song in a shuffled playlist.

8. Songs

The tokens need not be words. They could be identifiers. Such as, song IDs in a person's listening history.

9. Videos

Or video IDs. This can be used to recommend a video based on a person's viewing history.

10. Selecting the data

Let's return to our dataset. Previously we created a dataframe containing feature data. It contains a column called "endword". That column corresponds to the last word of the sentence. A logistic classifier wants a binary column indicating whether the row is a positive or negative example. Here's how to add a label column. Suppose we wanted to predict whether the endword is a gender pronoun. The following assigns a 1 to all rows where the endword column is in she, he, hers, his, her, and him. The next statement assigns a 0 to all other rows, namely the ones where the end word is NOT in this group of words. We just created training data to pose the question: is the end word in this group of words or not? We could also do this for one word at a time. Repeating that for every word gives data to train a set of models, one for each word. Each model guesses whether its respective word is the one that ends the sentence.

11. Combining the positive and negative data

Previously we created two dataframes, one containing positive labeled examples, and one containing negative labeled examples. Now we just need to combine these and we have training data. We do that using union().

12. Splitting the data into training and evaluation sets

Next, we split this into a training set and an evaluation set. The dataframe operation randomSplit() splits a dataframe into two subsets. The first argument is a 2-tuple giving the portion desired in the first and second result, respectively. The second argument is a seed to turn off randomness if we need to replicate the result. This will put 60% of the data into df_train, and 40% into df_eval.

13. Training

Logistic regression is suitable for this task. To use logistic regression, first import it from the ML classification library. Then, instantiate the model with the LogisticRegression() function. It provides arguments for key hyperparameters, including the maximum number of training iterations it should run, here set to 50. regParam and elasticNetParam are regularization parameters associated with elastic net regularization. They are outside the scope of this course. Train the model by using the fit() function, with the training dataframe provided as the first argument. You inspect the fitted model's summary to see details about the model, such as the total number of actual training iterations it ran.

14. Let's practice!

We now have training data we can use to predict whether a particular word ends a sentence. Let's do it!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Introduction to Spark SQL in Python

AdvancedSkill Level

4.8+

74 reviews