Random Forest
Setting Up Random Forest Regression
Load in required libraries
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
Initialize Random Forest object
rf = RandomForestRegressor(labelCol="label", featuresCol="features")
Create a parameter grid for tuning the model
rfparamGrid = (ParamGridBuilder()
#.addGrid(rf.maxDepth, [2, 5, 10, 20, 30])
.addGrid(rf.maxDepth, [2, 5, 10])
#.addGrid(rf.maxBins, [10, 20, 40, 80, 100])
.addGrid(rf.maxBins, [5, 10, 20])
#.addGrid(rf.numTrees, [5, 20, 50, 100, 500])
.addGrid(rf.numTrees, [5, 20, 50])
.build())
Define how you want the model to be evaluated
rfevaluator = RegressionEvaluator(predictionCol="prediction", labelCol="label", metricName="rmse")
Define the type of cross-validation you want to perform
# Create 5-fold CrossValidator
rfcv = CrossValidator(estimator = rf,
estimatorParamMaps = rfparamGrid,
evaluator = rfevaluator,
numFolds = 5)
Fit the model to the data
rfcvModel = rfcv.fit(train)
print(rfcvModel)
Score the testing dataset using your fitted model for evaluation purposes
rfpredictions = rfcvModel.transform(test)
Evaluate the model
print('RMSE:', rfevaluator.evaluate(rfpredictions))
Last updated
Was this helpful?