1. Creating feature data for classification
Welcome back.
Now that we know how to create UDFs and know the difference between sparse vector and dense array formats, we are ready to put data into a format that can be fed into a machine learning algorithm. Pyspark is similar to Python in many respects. However, there are some notable differences.
2. Transforming a dense array
Spark UDFs don't automatically cast objects the way you might expect Python to do. Suppose we have a dataframe with a column containing a dense array. We want to extract the first item from this array, and convert it to an integer.
Here is an attempt at creating such a UDF. This accepts an argument that is an array, and uses the indices() operation to return the first item of the array. It does several things correctly. However, it still breaks, because it is missing one crucial step. First, let's walk through what it is doing right. Note the if and else clauses. UDFs typically require checks to ensure that the argument is of the correct type. Here, the if clause ensures that x is an instantiated object. Next, it ensures that it is an array. It does this by checking to see whether the object has an attribute 'toArray'. Finally, it confirms that the array is nonempty by calling the numNonzeros() operation. This is a fast and reliable way of determining that an array contains at least one item. If none of these checks pass, it returns the integer 0. Otherwise, it returns the first element of the array. A UDF also requires that we specify the type of the result. Here, that is given by IntegerType().
3. Transforming a dense array
Let's wrap a call to this in a try-except exception handler.
This will return the following cryptic error.
The way to fix this is to cast the returned item to an int.
4. UDF return type must be properly cast
This UDF functions properly. You must typically cast the result to the proper type. Spark is generally less forgiving about this than Python. The type of error we saw was generated by Py4J. Py4J is a library that is installed by Spark. Py4J enables Python programs to access Java objects in a Java Virtual Machine. Py4J also enables Java programs to call back Python objects. Often this can be because data was not handed off properly that results in such an error message.
5. The UDF in action
Here's the previous UDF in action.
Here are the first few rows of an example dataframe. The outvec column is a dense array containing one item. We want to replace the outvec column with a column containing an integer corresponding to the first item in the array. We'll call the new column 'label'.
This statement achieves the desired result. It applies the k_udf UDF to the 'outvec' column, calling the resulting column 'label'. Then, it drops the 'outvec' column.
Here is what that looks like.
6. CountVectorizer
Previously we covered ETS,
which means Extract, Transform, and Select.
CountVectorizer is typically used as a Feature Extractor.
It converts an array of strings
into a sparse vector.
7. Fitting the CountVectorizer
First, Import CountVectorizer from pyspark.ml.feature.
To instantiate it specify the name of the inputColumn and outputColumn.
We fit it to the dataframe by calling fit(), providing the dataframe as the argument.
Here is what that might look like.
Recall that a sparse vector is a compact representation of an array having many zeroes.
In the first row, hello world is the dense word array. On the right is its dense vector form. This vector has length 10 because there are tokens in the vocabulary. The middle array of ints says that hello is token number 7 in the vocabulary, and that world is token number 9. The last array of decimal values tells us that each word appeared once in the original array.
8. Let's practice!
Let's go try what we just learned.