Reading and Writing Data

Reading in Data

...from Mounted Storage

1
dataset = sqlContext.read.format('csv') \
2
.options(header='true', inferSchema='true', delimiter= ',') \
3
.load('/mnt/<FOLDERNAME>/<FILENAME>.csv')
4
5
## or spark.read.format('csv')...
6
## Formats: json, parquet, jdbc, orc, libsvm, csv, text, avro
Copied!

...when Schema Inference Fails

1
from pyspark.sql.types import *
2
3
schema = StructType([StructField('ID', IntegerType(), True),
4
StructField('Value', DoubleType(), True),
5
StructField('Category', StringType(), True),
6
StructField('Date', DateType(), True)])
7
8
dataset = sqlContext.read.format('csv') \
9
.schema(schema) \
10
.options(header='true', delimiter= ',') \
11
.load('/mnt/<FOLDERNAME>/<FILENAME>.csv')
Copied!

Writing out Data

1
df.coalesce(1) \
2
.write.format("com.databricks.spark.csv") \
3
.option("header", "true") \
4
.save("file.csv")
Copied!

Other Resources

Apache Spark Data Sources Documentation: https://spark.apache.org/docs/latest/sql-data-sources.html