Shaping Data with Pipelines
Load in required libraries
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, OneHotEncoderEstimator, StringIndexer, VectorAssemblerDefine which columns are numerical versus categorical (and which is the label column)
label = "dependentvar"
categoricalColumns = ["col1",
"col2"]
numericalColumns = ["num1",
"num2"]
#categoricalColumnsclassVec = ["col1classVec",
# "col2classVec"]
categoricalColumnsclassVec = [c + "classVec" for c in categoricalColumns]Set up stages
stages = []Index the categorical columns and perform One Hot Encoding
Index the label column and perform One Hot Encoding
Assemble the data together as a vector
Scale features using Normalization
Set up the transformation pipeline using the stages you've created along the way
Pipeline Saving and Loading
Save the transformation pipeline
Load in the transformation pipeline
Last updated
Was this helpful?