Sparkitecture
  • Welcome to Sparkitecture!
  • Cloud Service Integration
    • Azure Storage
    • Azure SQL Data Warehouse / Synapse
    • Azure Data Factory
  • Data Preparation
    • Reading and Writing Data
    • Shaping Data with Pipelines
    • Other Common Tasks
  • Machine Learning
    • About Spark MLlib
    • Classification
      • Logistic Regression
      • Naïve Bayes
      • Decision Tree
      • Random Forest
      • Gradient-Boosted Trees
    • Regression
      • Linear Regression
      • Decision Tree
      • Random Forest
      • Gradient-Boosted Trees
    • MLflow
    • Feature Importance
    • Model Saving and Loading
    • Model Evaluation
  • Streaming Data
    • Structured Streaming
  • Operationalization
    • API Serving
    • Batch Scoring
  • Natural Language Processing
    • Text Data Preparation
    • Model Evaluation
  • Bioinformatics and Genomics
    • Glow
Powered by GitBook
On this page
  • Model Saving
  • Save model(s) to mounted storage
  • Remove a model
  • Score new data using a trained model
  • Load in required libraries
  • Load in the transformation pipeline
  • Load in the trained model
  • Remove unnecessary columns from the scored data

Was this helpful?

Export as PDF
  1. Machine Learning

Model Saving and Loading

Model Saving

Save model(s) to mounted storage

lrcvModel.save("/mnt/trainedmodels/lr")
rfcvModel.save("/mnt/trainedmodels/rf")
dtcvModel.save("/mnt/trainedmodels/dt")
display(dbutils.fs.ls("/mnt/trainedmodels/"))

Remove a model

Spark MLlib models are actually a series of files in a directory. So, you will need to recursively delete the files in model's directory, then the directory itself.

dbutils.fs.rm("/mnt/trainedmodels/dt", True)

Score new data using a trained model

Load in required libraries

from pyspark.ml.tuning import CrossValidatorModel
from pyspark.ml import PipelineModel
from pyspark.sql.functions import col, round
from pyspark.sql.types import IntegerType, FloatType

Load in the transformation pipeline

pipeline = PipelineModel.load("/mnt/trainedmodels/pipeline/")
## Fit the pipeline to new data
transformeddataset = pipeline.transform(dataset)

Load in the trained model

model = CrossValidatorModel.load("/mnt/trainedmodels/lr/")
## Score the data using the model
scoreddataset = model.bestModel.transform(transformeddataset)

Remove unnecessary columns from the scored data

## Function to extract probability from array
getprob = udf(lambda v:float(v[1]),FloatType())

## Select out the necessary columns
output = scoreddataset.select(col("ID"),
                              col("label"),
                              col("rawPrediction"),           
                              getprob(col("probability")).alias("probability"),
                              col("prediction"))

PreviousFeature ImportanceNextModel Evaluation

Last updated 5 years ago

Was this helpful?