Get startedGet started for free

Assemble a vector

The last step in the Pipeline is to combine all of the columns containing our features into a single column. This has to be done before modeling can take place because every Spark modeling routine expects the data to be in this form. You can do this by storing each of the values from a column as an entry in a vector. Then, from the model's point of view, every observation is a vector that contains all of the information about it and a label that tells the modeler what value that observation corresponds to.

Because of this, the pyspark.ml.feature submodule contains a class called VectorAssembler. This Transformer takes all of the columns you specify and combines them into a new vector column.

This exercise is part of the course

Foundations of PySpark

View Course

Exercise instructions

  • Create a VectorAssembler by calling VectorAssembler() with the inputCols names as a list and the outputCol name "features".
    • The list of columns should be ["month", "air_time", "carrier_fact", "dest_fact", "plane_age"].

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Make a VectorAssembler
vec_assembler = VectorAssembler(inputCols=____, outputCol=____)
Edit and Run Code