Sparkitecture
  • Welcome to Sparkitecture!
  • Cloud Service Integration
    • Azure Storage
    • Azure SQL Data Warehouse / Synapse
    • Azure Data Factory
  • Data Preparation
    • Reading and Writing Data
    • Shaping Data with Pipelines
    • Other Common Tasks
  • Machine Learning
    • About Spark MLlib
    • Classification
      • Logistic Regression
      • Naïve Bayes
      • Decision Tree
      • Random Forest
      • Gradient-Boosted Trees
    • Regression
      • Linear Regression
      • Decision Tree
      • Random Forest
      • Gradient-Boosted Trees
    • MLflow
    • Feature Importance
    • Model Saving and Loading
    • Model Evaluation
  • Streaming Data
    • Structured Streaming
  • Operationalization
    • API Serving
    • Batch Scoring
  • Natural Language Processing
    • Text Data Preparation
    • Model Evaluation
  • Bioinformatics and Genomics
    • Glow
Powered by GitBook
On this page
  • Included Functionality:
  • Resources

Was this helpful?

Export as PDF
  1. Machine Learning

About Spark MLlib

MLlib is Apache Spark's scalable machine learning library.

MLlib works with Spark's APIs and with NumPy in Python and with R libraries. Since Spark excels at iterative computation, MLlib runs very fast with highly-scalable, high-quality algorithms that leverage iteration.

Included Functionality:

ML algorithms include:

  • Classification: logistic regression, naive Bayes,...

  • Regression: generalized linear regression, survival regression,...

  • Decision trees, random forests, and gradient-boosted trees

  • Recommendation: alternating least squares (ALS)

  • Clustering: K-means, Gaussian mixtures (GMMs),...

  • Topic modeling: latent Dirichlet allocation (LDA)

  • Frequent itemsets, association rules, and sequential pattern mining

ML workflow utilities include:

  • Feature transformations: standardization, normalization, hashing,...

  • ML Pipeline construction

  • Model evaluation and hyper-parameter tuning

  • ML persistence: saving and loading models and Pipelines

Other utilities include:

  • Distributed linear algebra: SVD, PCA,...

  • Statistics: summary statistics, hypothesis testing,...

Resources

PreviousOther Common TasksNextClassification

Last updated 5 years ago

Was this helpful?

Spark MLlib Website
Getting Starting Guide