Sparkitecture
  • Welcome to Sparkitecture!
  • Cloud Service Integration
    • Azure Storage
    • Azure SQL Data Warehouse / Synapse
    • Azure Data Factory
  • Data Preparation
    • Reading and Writing Data
    • Shaping Data with Pipelines
    • Other Common Tasks
  • Machine Learning
    • About Spark MLlib
    • Classification
      • Logistic Regression
      • Naïve Bayes
      • Decision Tree
      • Random Forest
      • Gradient-Boosted Trees
    • Regression
      • Linear Regression
      • Decision Tree
      • Random Forest
      • Gradient-Boosted Trees
    • MLflow
    • Feature Importance
    • Model Saving and Loading
    • Model Evaluation
  • Streaming Data
    • Structured Streaming
  • Operationalization
    • API Serving
    • Batch Scoring
  • Natural Language Processing
    • Text Data Preparation
    • Model Evaluation
  • Bioinformatics and Genomics
    • Glow
Powered by GitBook
On this page
  • About Glow
  • Features:
  • How To Install
  • pip Installation
  • Maven Installation
  • Load in Glow
  • Read in Data
  • Summary Statistics and Quality Control
  • Split Multiallelic Variants to Biallelic
  • Write out Data

Was this helpful?

Export as PDF
  1. Bioinformatics and Genomics

Glow

PreviousModel Evaluation

Last updated 4 years ago

Was this helpful?

About Glow

Glow is an open-source and independent Spark library that brings even more flexibility and functionality to Azure Databricks. This toolkit is natively built on Apache Spark, enabling the scale of the cloud for genomics workflows.

Glow allows for genomic data to work with Spark SQL. So, you can interact with common genetic data types as easily as you can play with a .csv file.

Learn more about Project Glow at .

Read the full documentation:

Features:

  • Genomic datasources: To read datasets in common file formats such as VCF, BGEN, and Plink into Spark DataFrames.

  • Genomic functions: Common operations such as computing quality control statistics, running regression tests, and performing simple transformations are provided as Spark functions that can be called from Python, SQL, Scala, or R.

  • Data preparation building blocks: Glow includes transformations such as variant normalization and lift over to help produce analysis ready datasets.

  • Integration with existing tools: With Spark, you can write user-defined functions (UDFs) in Python, R, SQL, or Scala. Glow also makes it easy to run DataFrames through command line tools.

  • Integration with other data types: Genomic data can generate additional insights when joined with data sets such as electronic health records, real world evidence, and medical images. Since Glow returns native Spark SQL DataFrames, its simple to join multiple data sets together.

How To Install

pip Installation

./bin/pyspark --packages io.projectglow:glow_2.11:0.5.0
 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

Maven Installation

Install the maven package io.project:glow_2.11:${version} and optionally the Python frontend glow.py. Set the Spark configuration spark.hadoop.io.compression.codecs to io.projectglow.sql.util.BGZFCodec in order to read and write BGZF-compressed files.

Load in Glow

import glow
glow.register(spark)

Read in Data

vcf_path = "/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"

df = spark.read.format("vcf")\
          .option("includeSampleIds", False)\
          .option("flattenInfoFields", False)\
          .load(vcf_path)\
          .withColumn("first_genotype", expr("genotypes[0]"))
          
# bgen_path = "/databricks-datasets/genomics/1kg-bgens/1kg_chr22.bgen"

# df = spark.read.format("bgen") \
#           .load(bgen_path)

Summary Statistics and Quality Control

df = df.withColumn("hardyweinberg", expr("hardy_weinberg(genotypes)")) \
       .withColumn("summarystats", expr("call_summary_stats(genotypes)")) \
       .withColumn("depthstats", expr("dp_summary_stats(genotypes)")) \
       .withColumn("genotypequalitystats", expr("gq_summary_stats(genotypes)")) \
       .filter(col("qual") >= 98) \
       .filter((col("start") >= 16000000) & (col("end") >= 16050000)) \
       .where((col("alleleFrequencies").getItem(0) >= allele_freq_cutoff) & 
              (col("alleleFrequencies").getItem(0) <= (1.0 - allele_freq_cutoff))) \
       .withColumn("log10pValueHwe", when(col("pValueHwe") == 0, 26).otherwise(-log10(col("pValueHwe"))))

Split Multiallelic Variants to Biallelic

split_df = glow.transform("split_multiallelics", df)

Write out Data

df.coalesce(1) \
  .write \
  .mode("overwrite") \
  .format("vcf") \
  .save("/tmp/vcf_output")

If you're using Databricks, make sure you enable the . Glow is already included and configured in this runtime.

Using pip, install by simply running pip install glow.py and then start the with the Glow maven package.

Databricks Runtime for Genomics
Spark shell
projectglow.io
glow.readthedocs.io