Sparkitecture
  • Welcome to Sparkitecture!
  • Cloud Service Integration
    • Azure Storage
    • Azure SQL Data Warehouse / Synapse
    • Azure Data Factory
  • Data Preparation
    • Reading and Writing Data
    • Shaping Data with Pipelines
    • Other Common Tasks
  • Machine Learning
    • About Spark MLlib
    • Classification
      • Logistic Regression
      • Naïve Bayes
      • Decision Tree
      • Random Forest
      • Gradient-Boosted Trees
    • Regression
      • Linear Regression
      • Decision Tree
      • Random Forest
      • Gradient-Boosted Trees
    • MLflow
    • Feature Importance
    • Model Saving and Loading
    • Model Evaluation
  • Streaming Data
    • Structured Streaming
  • Operationalization
    • API Serving
    • Batch Scoring
  • Natural Language Processing
    • Text Data Preparation
    • Model Evaluation
  • Bioinformatics and Genomics
    • Glow
Powered by GitBook
On this page
  • Setting Up Linear Regression
  • Load in required libraries
  • Initialize Linear Regression object
  • Create a parameter grid for tuning the model
  • Define how you want the model to be evaluated
  • Define the type of cross-validation you want to perform
  • Fit the model to the data
  • Get model information
  • Score the testing dataset using your fitted model for evaluation purposes
  • Evaluate the model

Was this helpful?

Export as PDF
  1. Machine Learning
  2. Regression

Linear Regression

Setting Up Linear Regression

Note: Make sure you have your training and test data already vectorized and ready to go before you begin trying to fit the machine learning model to unprepped data.

Load in required libraries

from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

Initialize Linear Regression object

lr = LinearRegression(labelCol="label", featuresCol="features")

Create a parameter grid for tuning the model

lrparamGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.001, 0.01, 0.1, 0.5, 1.0, 2.0])
             #  .addGrid(lr.regParam, [0.01, 0.1, 0.5])
             .addGrid(lr.elasticNetParam, [0.0, 0.25, 0.5, 0.75, 1.0])
             #  .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10, 20, 50])
             #  .addGrid(lr.maxIter, [1, 5, 10])
             .build())

Define how you want the model to be evaluated

lrevaluator = RegressionEvaluator(predictionCol="prediction", labelCol="label", metricName="rmse")

Define the type of cross-validation you want to perform

# Create 5-fold CrossValidator
lrcv = CrossValidator(estimator = lr,
                    estimatorParamMaps = lrparamGrid,
                    evaluator = lrevaluator,
                    numFolds = 5)

Fit the model to the data

lrcvModel = lrcv.fit(train)
print(lrcvModel)

Get model information

lrcvSummary = lrcvModel.bestModel.summary
print("Coefficient Standard Errors: " + str(lrcvSummary.coefficientStandardErrors))
print("P Values: " + str(lrcvSummary.pValues)) # Last element is the intercept

Score the testing dataset using your fitted model for evaluation purposes

lrpredictions = lrcvModel.transform(test)

Evaluate the model

print('RMSE:', lrevaluator.evaluate(lrpredictions))

Note: When you use the CrossValidator function to set up cross-validation of your models, the resulting model object will have all the runs included, but will only use the best model when you interact with the model object using other functions like evaluate or transform.

PreviousRegressionNextDecision Tree

Last updated 5 years ago

Was this helpful?