Predicting and evaluating

1. Predicting and evaluating

We've trained a classification model to predict the end word of a sentence. We will now address how to use it to predict, and how to evaluate its accuracy.

2. Applying a model to evaluation data

To apply a trained model to the test data, use the transform() operation. Like many operations in Apache Spark, the transform operation returns a dataframe. It adds columns to the dataset, a prediction column and a probability column. The prediction column is a double, even though in our case it only takes the values 0 or 1. The probability column is a vector containing two numbers. Each number in the vector is between zero and one. Typically these add to 1. The first number in the probability vector is the estimated probability that the prediction is false -- that the data is not in the class that the model was trained to classify. The second number in the probability vector is the estimated probability that the prediction is true. The default means of converting the probability to a prediction uses a threshold of 0.5. The probability vector allows us to use a threshold of our own choosing. Suppose we want to see whether a prediction was correct. Here is how to do that. We cast the prediction to an int, and then compare it to the label.

3. Evaluating classification accuracy

Suppose you have a trained classification model called model, and some evaluation data in a dataframe called df_eval. To calculate the performance of this classification model, we use a metric called Area Under Curve, or AUC for short. First, we apply the model to the dataframe df_eval using model.evaluate. This returns a BinaryLogisticRegressionSummary object. This is in the pyspark.ml.classification module. Then, we call the areaUnderROC() function on the model_stats object.

4. Example of classifying text

To give you a taste of how this technique performs, here are the results for a classification task based on the Sherlock Holmes text corpus that we have been using. The task was to predict whether or not the last word of a sentence was in the set 'her', 'him', 'he', 'she', 'them', 'us', 'they', 'himself', 'herself', or 'we'. We obtained 5746 examples, comprising an equal number of positive and negative examples. The data was split into 4607 training and 1139 test examples. Training took 21 iterations. The resulting AUC was 0.87.

5. Predicting the endword

We redid the task, this time asking the question: did the sentence end with the word "it?". As might be expected, there are fewer positive data available for this task, because fewer sentences end in the word "it". On a test set of 98, the classifier was able to correctly guess whether the endword was "it" or not 85% of the time. You can readily imagine extending this idea to predict each word for which there is an adequate supply of data to yield an accurate and reliable prediction.

6. Let's practice!

Let's go try out what we just learned.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Introduction to Spark SQL in Python

AdvancedSkill Level

4.8+

74 reviews