Assemble a vector
The last step in the Pipeline
is to combine all of the columns containing our features into a single column. This has to be done before modeling can take place because every Spark modeling routine expects the data to be in this form. You can do this by storing each of the values from a column as an entry in a vector. Then, from the model's point of view, every observation is a vector that contains all of the information about it and a label that tells the modeler what value that observation corresponds to.
Because of this, the pyspark.ml.feature
submodule contains a class called VectorAssembler
. This Transformer
takes all of the columns you specify and combines them into a new vector column.
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Create a
VectorAssembler
by callingVectorAssembler()
with theinputCols
names as a list and theoutputCol
name"features"
.- The list of columns should be
["month", "air_time", "carrier_fact", "dest_fact", "plane_age"]
.
- The list of columns should be
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Make a VectorAssembler
vec_assembler = VectorAssembler(inputCols=____, outputCol=____)