Transforming text to vector format

You learned how to split sentences and transform an array of words into a numerical vector using a CountVectorizer.

A dataframe df is provided having the following columns: sentence, in, and out. Each column is an array of strings. sentence is a list of words representing a sentence from a text book. The out column gives the last word of sentence. The in column is obtained by removing the last word from sentence.

The CountVectorizer model expects a dataframe having a column words and creates a column vec.

You will first perform a transform that adds an invec column, which looks like the following:

+----------------------+-------+------------------------------------+
|in                    |out    |invec                               |
+----------------------+-------+------------------------------------+
|[then, how, many, are]|[there]|(126,[3,18,28,30],[1.0,1.0,1.0,1.0])|
|[how]                 |[many] |(126,[28],[1.0])                    |
|[i, donot]            |[know] |(126,[15,78],[1.0,1.0])             |
+----------------------+-------+------------------------------------+
only showing top 3 rows

Then you will perform a second transform, which looks like the following:

+------------------------------------+----------------+
|invec                               |outvec          |
+------------------------------------+----------------+
|(126,[3,18,28,30],[1.0,1.0,1.0,1.0])|(126,[11],[1.0])|
|(126,[28],[1.0])                    |(126,[18],[1.0])|
|(126,[15,78],[1.0,1.0])             |(126,[21],[1.0])|
+------------------------------------+----------------+
only showing top 3 rows

Create a dataframe called result by using model to transform() df. result has the columns sentence, in, out, and invec. invec is the vector transformation of the in column.
Add a column to result called outvec. result now has the columns sentence, in, out, invec, and outvec.

PySpark SQL

Using Window Function SQL for Natural Language Processing

Caching, Logging, and the Spark UI

Text Classification

Exercise

Transforming text to vector format

Instructions