Model Saving and Loading

Model Saving

Save model(s) to mounted storage

1
lrcvModel.save("/mnt/trainedmodels/lr")
2
rfcvModel.save("/mnt/trainedmodels/rf")
3
dtcvModel.save("/mnt/trainedmodels/dt")
4
display(dbutils.fs.ls("/mnt/trainedmodels/"))
Copied!

Remove a model

Spark MLlib models are actually a series of files in a directory. So, you will need to recursively delete the files in model's directory, then the directory itself.
1
dbutils.fs.rm("/mnt/trainedmodels/dt", True)
Copied!

Score new data using a trained model

Load in required libraries

1
from pyspark.ml.tuning import CrossValidatorModel
2
from pyspark.ml import PipelineModel
3
from pyspark.sql.functions import col, round
4
from pyspark.sql.types import IntegerType, FloatType
Copied!

Load in the transformation pipeline

1
pipeline = PipelineModel.load("/mnt/trainedmodels/pipeline/")
2
## Fit the pipeline to new data
3
transformeddataset = pipeline.transform(dataset)
Copied!

Load in the trained model

1
model = CrossValidatorModel.load("/mnt/trainedmodels/lr/")
2
## Score the data using the model
3
scoreddataset = model.bestModel.transform(transformeddataset)
Copied!

Remove unnecessary columns from the scored data

1
## Function to extract probability from array
2
getprob = udf(lambda v:float(v[1]),FloatType())
3
4
## Select out the necessary columns
5
output = scoreddataset.select(col("ID"),
6
col("label"),
7
col("rawPrediction"),
8
getprob(col("probability")).alias("probability"),
9
col("prediction"))
Copied!