Glow
About Glow
Glow is an open-source and independent Spark library that brings even more flexibility and functionality to Azure Databricks. This toolkit is natively built on Apache Spark, enabling the scale of the cloud for genomics workflows.
Glow allows for genomic data to work with Spark SQL. So, you can interact with common genetic data types as easily as you can play with a .csv file.
Learn more about Project Glow at projectglow.io.
Read the full documentation: glow.readthedocs.io
Features:
Genomic datasources: To read datasets in common file formats such as VCF, BGEN, and Plink into Spark DataFrames.
Genomic functions: Common operations such as computing quality control statistics, running regression tests, and performing simple transformations are provided as Spark functions that can be called from Python, SQL, Scala, or R.
Data preparation building blocks: Glow includes transformations such as variant normalization and lift over to help produce analysis ready datasets.
Integration with existing tools: With Spark, you can write user-defined functions (UDFs) in Python, R, SQL, or Scala. Glow also makes it easy to run DataFrames through command line tools.
Integration with other data types: Genomic data can generate additional insights when joined with data sets such as electronic health records, real world evidence, and medical images. Since Glow returns native Spark SQL DataFrames, its simple to join multiple data sets together.
How To Install
If you're using Databricks, make sure you enable the Databricks Runtime for Genomics. Glow is already included and configured in this runtime.
pip Installation
Using pip, install by simply running pip install glow.py
and then start the Spark shell with the Glow maven package.
Maven Installation
Install the maven package io.project:glow_2.11:${version}
and optionally the Python frontend glow.py
. Set the Spark configuration spark.hadoop.io.compression.codecs
to io.projectglow.sql.util.BGZFCodec
in order to read and write BGZF-compressed files.
Load in Glow
Read in Data
Summary Statistics and Quality Control
Split Multiallelic Variants to Biallelic
Write out Data
Last updated