Transforming text to vector format
You learned how to split sentences and transform an array of words into a numerical vector using a CountVectorizer
.
A dataframe df
is provided having the following columns: sentence
, in
, and out
. Each column is an array of strings. sentence
is a list of words representing a sentence from a text book. The out
column gives the last word of sentence
. The in
column is obtained by removing the last word from sentence
.
The CountVectorizer model
expects a dataframe having a column words
and creates a column vec
.
You will first perform a transform that adds an invec
column, which looks like the following:
+----------------------+-------+------------------------------------+
|in |out |invec |
+----------------------+-------+------------------------------------+
|[then, how, many, are]|[there]|(126,[3,18,28,30],[1.0,1.0,1.0,1.0])|
|[how] |[many] |(126,[28],[1.0]) |
|[i, donot] |[know] |(126,[15,78],[1.0,1.0]) |
+----------------------+-------+------------------------------------+
only showing top 3 rows
Then you will perform a second transform, which looks like the following:
+------------------------------------+----------------+
|invec |outvec |
+------------------------------------+----------------+
|(126,[3,18,28,30],[1.0,1.0,1.0,1.0])|(126,[11],[1.0])|
|(126,[28],[1.0]) |(126,[18],[1.0])|
|(126,[15,78],[1.0,1.0]) |(126,[21],[1.0])|
+------------------------------------+----------------+
only showing top 3 rows
This exercise is part of the course
Introduction to Spark SQL in Python
Exercise instructions
- Create a dataframe called
result
by usingmodel
totransform()
df
.result
has the columnssentence
,in
,out
, andinvec
.invec
is the vector transformation of thein
column. - Add a column to
result
calledoutvec
.result
now has the columnssentence
,in
,out
,invec
, andoutvec
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Transform df using model
result = model.____(df.withColumnRenamed('in', 'words'))\
.withColumnRenamed('words', 'in')\
.withColumnRenamed('vec', 'invec')
result.drop('sentence').show(3, False)
# Add a column based on the out column called outvec
result = model.transform(result.withColumnRenamed('out', 'words'))\
.withColumnRenamed('words', 'out')\
.withColumnRenamed('vec', '____')
result.select('invec', 'outvec').show(3, False)