1 of 36

Sparkitecture

Welcome to Sparkitecture!

Created by: Colby T. Ford, Ph.D.

PySpark Edition | A work in progress... | Created using GitBook.com

About

Sparkitecture is a collection of “cookbook-style” scripts for simplifying data engineering and machine learning in Apache Spark.

This is an open source project (GPL v3.0) for the Spark community. If you have ideas or contributions you'd like to add, submit a or a write your code/tutorial/page and create a in the GitHub repo.

How to Cite

Cloud Service Integration

Azure Storage

Storage is a managed service in Azure that provides highly available, secure, durable, scalable, and redundant storage for your data. Azure Storage includes both Blobs, Data Lake Store, and others.

Databricks-Specific Functionality

Mounting Blob Storage

Once you create your blob storage account in Azure, you will need to grab a couple bits of information from the Azure Portal before you mount your storage.

You can find your Storage Account Name (which will go in below) and your Key (which will go in below) under Access Keys in your Storage Account resource in Azure.

Go into your Storage Account resource in Azure and click on Blobs. Here, you will find all of your containers. Pick the one you want to mount and copy its name into below.

As for the mount point (/mnt/<FOLDERNAME> below), you can name this whatever you'd like, but it will help you in the long run to name it something useful along the lines of storageaccount_container.

Once you have the required bits of information, you can use the following code to mount the storage location inside the Databricks environment

You can then test to see if you can list the files in your mounted location

Resources:

To learn how to create an Azure Storage service, visit

Mounting Data Lake Storage

For finer-grained access controls on your data, you may opt to use Azure Data Lake Storage. In Databricks, you can connect to your data lake in a similar manner to blob storage. Instead of an access key, your user credentials will be passed through, therefore only showing you data that you specifically have access to.

Pass-through Azure Active Directory Credentials

To pass in your Azure Active Directory credentials from Databricks to Azure Data Lake Store, you will need to enable this feature in Databricks under New Cluster > Advanced Options.

Note: If you create a High Concurrency cluster, multiple users can use the same cluster. The Standard cluster mode will only allow a single user's credential at a time.

Azure SQL Data Warehouse / Synapse

Set up Azure SQL DW connection parameters

Define a query

Azure Data Factory

Transformation with Azure Databricks

Using Azure Databricks with Azure Data Factory, notebooks can be run from an end-to-end pipeline that contains the Validation, Copy data, and Notebook activities in Azure Data Factory.

Data Preparation

Reading and Writing Data

Reading in Data

...from Mounted Storage

dataset = spark.read.format('csv') \
                    .options(header='true', inferSchema='true', delimiter= ',') \
                    .load('/mnt/<FOLDERNAME>/<FILENAME>.csv')

## or spark.read.format('csv')...
## Formats: json, parquet, jdbc, orc, libsvm, csv, text, avro

...when Schema Inference Fails

Writing out Data

Other Resources

Apache Spark Data Sources Documentation:

Machine Learning

About Spark MLlib

MLlib is Apache Spark's scalable machine learning library.

MLlib works with Spark's APIs and with NumPy in Python and with R libraries. Since Spark excels at iterative computation, MLlib runs very fast with highly-scalable, high-quality algorithms that leverage iteration.

Included Functionality:

ML algorithms include:

Classification: logistic regression, naive Bayes,...
Regression: generalized linear regression, survival regression,...
Decision trees, random forests, and gradient-boosted trees

ML workflow utilities include:

Feature transformations: standardization, normalization, hashing,...
ML Pipeline construction
Model evaluation and hyper-parameter tuning

Other utilities include:

Distributed linear algebra: SVD, PCA,...
Statistics: summary statistics, hypothesis testing,...

Resources

Classification

Description:

Classification algorithms are used to identify into which classes observations of data should fall. This problem could be considered part of pattern recognition in that we use training data (historical information) to recognize patterns to predict where new data should be categorized.

Common Use Cases:

Fraudulent activity detection
Loan default prediction
Spam vs. ham

Classification Algorithms included in MLlib:

(both binomial and multiclass)

Logistic Regression

Setting Up a Logistic Regression Classifier

Note: Make sure you have your training and test data already vectorized and ready to go before you begin trying to fit the machine learning model to unprepped data.

Naïve Bayes

Setting Up a Naïve Bayes Classifier

Note: Make sure you have your training and test data already vectorized and ready to go before you begin trying to fit the machine learning model to unprepped data.

Load in required libraries

Initialize Naïve Bayes object

Create a parameter grid for tuning the model

Define how you want the model to be evaluated

Define the type of cross-validation you want to perform

Fit the model to the data

Score the testing dataset using your fitted model for evaluation purposes

Evaluate the model

Note: When you use the CrossValidatorfunction to set up cross-validation of your models, the resulting model object will have all the runs included, but will only use the best model when you interact with the model object using other functions like evaluate or transform.

Decision Tree

Setting Up a Decision Tree Classifier

Note: Make sure you have your training and test data already vectorized and ready to go before you begin trying to fit the machine learning model to unprepped data.

Random Forest

Setting Up a Random Forest Classifier

Note: Make sure you have your training and test data already vectorized and ready to go before you begin trying to fit the machine learning model to unprepped data.

Load in required libraries

Initialize Random Forest object

Create a parameter grid for tuning the model

Define how you want the model to be evaluated

Define the type of cross-validation you want to perform

Fit the model to the data

Score the testing dataset using your fitted model for evaluation purposes

Evaluate the model

Note: When you use the CrossValidator function to set up cross-validation of your models, the resulting model object will have all the runs included, but will only use the best model when you interact with the model object using other functions like evaluate or transform.

Gradient-Boosted Trees

Setting Up Gradient-Boosted Tree Classifier

Note: Make sure you have your training and test data already vectorized and ready to go before you begin trying to fit the machine learning model to unprepped data.

Regression

Linear Regression

Setting Up Linear Regression

Note: Make sure you have your training and test data already vectorized and ready to go before you begin trying to fit the machine learning model to unprepped data.

Load in required libraries

Initialize Linear Regression object

Create a parameter grid for tuning the model

Define how you want the model to be evaluated

Define the type of cross-validation you want to perform

Fit the model to the data

Get model information

Score the testing dataset using your fitted model for evaluation purposes

Evaluate the model

Note: When you use the CrossValidator function to set up cross-validation of your models, the resulting model object will have all the runs included, but will only use the best model when you interact with the model object using other functions like evaluate or transform.

Decision Tree

Setting Up Decision Tree Regression

Note: Make sure you have your training and test data already vectorized and ready to go before you begin trying to fit the machine learning model to unprepped data.

Load in required libraries

Initialize Decision Tree object

Create a parameter grid for tuning the model

Define how you want the model to be evaluated

Define the type of cross-validation you want to perform

Fit the model to the data

Score the testing dataset using your fitted model for evaluation purposes

Evaluate the model

Note: When you use the CrossValidator function to set up cross-validation of your models, the resulting model object will have all the runs included, but will only use the best model when you interact with the model object using other functions like evaluate or transform.

Random Forest

Setting Up Random Forest Regression

Note: Make sure you have your training and test data already vectorized and ready to go before you begin trying to fit the machine learning model to unprepped data.

Load in required libraries

Initialize Random Forest object

Create a parameter grid for tuning the model

Define how you want the model to be evaluated

Define the type of cross-validation you want to perform

Fit the model to the data

Score the testing dataset using your fitted model for evaluation purposes

Evaluate the model

Note: When you use the CrossValidator function to set up cross-validation of your models, the resulting model object will have all the runs included, but will only use the best model when you interact with the model object using other functions like evaluate or transform.

Gradient-Boosted Trees

Setting Up Gradient-Boosted Tree Regression

Note: Make sure you have your training and test data already vectorized and ready to go before you begin trying to fit the machine learning model to unprepped data.

MLflow

MLflow is an open source library by the Databricks team designed for managing the machine learning lifecycle. It allows for the creation of projects, tracking of metrics, and model versioning.

Install mlflow using pip

MLflow can be used in any Spark environmnet, but the automated tracking and UI of MLflow is Databricks-Specific Functionality.

Model Saving and Loading

Model Saving

Save model(s) to mounted storage

Streaming Data

Structured Streaming

Read in Streaming Data

Reading JSON files from storage

from pyspark.sql.types import *

inputPath = "/mnt/data/jsonfiles/"

# Define your schema if it's known (rather than relying on Spark to infer the schema)
jsonSchema = StructType([StructField("time", TimestampType(), True),
                         StructField("id", IntegerType(), True),
                         StructField("value", StringType(), True)])

streamingInputDF = spark.readStream \
                        .schema(jsonSchema) \
                        .option("maxFilesPerTrigger", 1) \ # Treat a sequence of files as a stream by picking one file at a time
                        .json(inputPath)

Sparkitecture

Welcome to Sparkitecture!

hashtagAbout

hashtagHow to Cite

Cloud Service Integration

Azure Storage

hashtagMounting Blob Storage

hashtagResources:

hashtagMounting Data Lake Storage

hashtagPass-through Azure Active Directory Credentials

Azure SQL Data Warehouse / Synapse

hashtagSet up Azure SQL DW connection parameters

hashtagDefine a query

Azure Data Factory

hashtagTransformation with Azure Databricks

Data Preparation

Reading and Writing Data

hashtagReading in Data

hashtag...from Mounted Storage

hashtag...when Schema Inference Fails

hashtagWriting out Data

hashtagOther Resources

Machine Learning

About Spark MLlib

hashtagIncluded Functionality:

hashtagML algorithms include:

hashtagML workflow utilities include:

hashtagOther utilities include:

hashtagResources

Classification

hashtagDescription:

hashtagCommon Use Cases:

hashtagClassification Algorithms included in MLlib:

Logistic Regression

hashtagSetting Up a Logistic Regression Classifier

Naïve Bayes

hashtagSetting Up a Naïve Bayes Classifier

hashtagLoad in required libraries

hashtagInitialize Naïve Bayes object

hashtagCreate a parameter grid for tuning the model

hashtagDefine how you want the model to be evaluated

hashtagDefine the type of cross-validation you want to perform

hashtagFit the model to the data

hashtagScore the testing dataset using your fitted model for evaluation purposes

hashtagEvaluate the model

Decision Tree

hashtagSetting Up a Decision Tree Classifier

Random Forest

hashtagSetting Up a Random Forest Classifier

hashtagLoad in required libraries

hashtagInitialize Random Forest object

hashtagCreate a parameter grid for tuning the model

hashtagDefine how you want the model to be evaluated

hashtagDefine the type of cross-validation you want to perform

hashtagFit the model to the data

hashtagScore the testing dataset using your fitted model for evaluation purposes

hashtagEvaluate the model

Gradient-Boosted Trees

hashtagSetting Up Gradient-Boosted Tree Classifier

Regression

Linear Regression

hashtagSetting Up Linear Regression

hashtagLoad in required libraries

hashtagInitialize Linear Regression object

hashtagCreate a parameter grid for tuning the model

hashtagDefine how you want the model to be evaluated

hashtagDefine the type of cross-validation you want to perform

hashtagFit the model to the data

hashtagGet model information

hashtagScore the testing dataset using your fitted model for evaluation purposes

hashtagEvaluate the model

Decision Tree

hashtagSetting Up Decision Tree Regression

hashtagLoad in required libraries

hashtagInitialize Decision Tree object

hashtagCreate a parameter grid for tuning the model

hashtagDefine how you want the model to be evaluated

hashtagDefine the type of cross-validation you want to perform

hashtagFit the model to the data

hashtagScore the testing dataset using your fitted model for evaluation purposes

About

How to Cite

Mounting Blob Storage

Resources:

Mounting Data Lake Storage

Pass-through Azure Active Directory Credentials

Set up Azure SQL DW connection parameters

Define a query

Transformation with Azure Databricks

Reading in Data

...from Mounted Storage

...when Schema Inference Fails

Writing out Data

Other Resources

Included Functionality:

ML algorithms include:

ML workflow utilities include:

Other utilities include:

Resources

Description:

Common Use Cases:

Classification Algorithms included in MLlib:

Setting Up a Logistic Regression Classifier

Setting Up a Naïve Bayes Classifier

Load in required libraries

Initialize Naïve Bayes object

Create a parameter grid for tuning the model

Define how you want the model to be evaluated

Define the type of cross-validation you want to perform

Fit the model to the data

Score the testing dataset using your fitted model for evaluation purposes

Evaluate the model

Setting Up a Decision Tree Classifier

Setting Up a Random Forest Classifier

Load in required libraries

Initialize Random Forest object

Create a parameter grid for tuning the model

Define how you want the model to be evaluated

Define the type of cross-validation you want to perform

Fit the model to the data

Score the testing dataset using your fitted model for evaluation purposes

Evaluate the model

Setting Up Gradient-Boosted Tree Classifier

Setting Up Linear Regression

Load in required libraries

Initialize Linear Regression object

Create a parameter grid for tuning the model

Define how you want the model to be evaluated

Define the type of cross-validation you want to perform

Fit the model to the data

Get model information

Score the testing dataset using your fitted model for evaluation purposes

Evaluate the model

Setting Up Decision Tree Regression

Load in required libraries

Initialize Decision Tree object

Create a parameter grid for tuning the model

Define how you want the model to be evaluated

Define the type of cross-validation you want to perform

Fit the model to the data

Score the testing dataset using your fitted model for evaluation purposes

Evaluate the model

Setting Up Random Forest Regression

Load in required libraries

Initialize Random Forest object

Create a parameter grid for tuning the model

Define how you want the model to be evaluated

Define the type of cross-validation you want to perform

Fit the model to the data

Score the testing dataset using your fitted model for evaluation purposes

Evaluate the model

Setting Up Gradient-Boosted Tree Regression

Install mlflow using pip

Model Saving

Save model(s) to mounted storage

Read in Streaming Data

Reading JSON files from storage

References

Multiclass classification evaluator

About