Gradient-Boosted Trees
Setting Up Gradient-Boosted Tree Classifier
Load in required libraries
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
Initialize Gradient-Boosted Tree object
gb = GBTClassifier(labelCol="label", featuresCol="features")
Create a parameter grid for tuning the model
gbparamGrid = (ParamGridBuilder()
.addGrid(gb.maxDepth, [2, 5, 10])
.addGrid(gb.maxBins, [10, 20, 40])
.addGrid(gb.maxIter, [5, 10, 20])
.build())
Define how you want the model to be evaluated
gbevaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
Define the type of cross-validation you want to perform
# Create 5-fold CrossValidator
gbcv = CrossValidator(estimator = gb,
estimatorParamMaps = gbparamGrid,
evaluator = gbevaluator,
numFolds = 5)
Fit the model to the data
gbcvModel = gbcv.fit(train)
print(gbcvModel)
Score the testing dataset using your fitted model for evaluation purposes
gbpredictions = gbcvModel.transform(test)
Evaluate the model
print('Accuracy:', gbevaluator.evaluate(gbpredictions))
print('AUC:', BinaryClassificationMetrics(gbpredictions['label','prediction'].rdd).areaUnderROC)
print('PR:', BinaryClassificationMetrics(gbpredictions['label','prediction'].rdd).areaUnderPR)
Last updated
Was this helpful?