Pyspark - reading data from files
Pyspark - reading data from files
Context
This note is scoped to files in Azure Datalake Gen 2 storage accessed from Synapse Analytics
Storage account must be linked to the Synapse Analytics workspace, and the relevant permissions set. (see docs).
In Synapse Analytics Spark Notebooks there is already a SparkSession automatically created, stored in a variable called spark
. Also there is a variable for SparkContext which is called sc
. Users can access these variables directly and should not change the values of these variables. docs
Reading different formats into a DataFrame
Parquet
df = spark.read.parquet('abfss://container@storageAccount/path/file.parquet')
To read all the Parquet files in a folder into one DataFrame use wildcard like this:
df = spark.read.parquet('abfss://container@storageAccount/path/*.parquet')
If the source data is partitioned, for example by createdYear
and you want one partition then state that in the path, for example:
df = spark.read.parquet('abfss://container@storageAccount/pathpart/createdYear=2021/*.parquet')
If the source data is partitioned, for example by createdYear
and you want all of the partitions merged into one DataFrame then use wildcards in the path, for example:
df = spark.read.parquet('abfss://container@storageAccount/pathpart/createdYear=*/*.parquet')
- Spark By Examples - PySpark Read and Write Parquet File
- pyspark.sql.DataFrameReader.parquet
- Spark SQL Guide - Parquet files
CSV
df = spark.read.csv('abfss://container@storageAccount/path/file.csv')
Header row - iIf the data file has a header row, this needs to be specified:
df = spark.read.option("header",True) \
.csv('abfss://container@storageAccount/path/file.csv')
Delimiter - by default is ,
can be overridden, for example if it is pipe:
df = spark.read.option("delimiter", '|') \
.csv('abfss://container@storageAccount/path/file.csv')
Options can be combined either by chaining calls to .option(...)
or by using .options(..)
:
df = spark.read.options(header='True', delimiter='|') \
.csv('abfss://container@storageAccount/path/file.csv')
Many other options, see the references:
- Spark By Examples - PySpark Read CSV file into DataFrame
- pyspark.sql.DataFrameReader.csv
- Spark SQL Guide - CSV files
JSON
File contains JSON in a single line
df = spark.read.json('abfss://container@storageAccount/path/file.json')
File contains JSON spread across multiple lines
df = spark.read.option("multiline","true") \
.json('abfss://container@storageAccount/path/file.json')
Many other options, see the references:
- Spark By Examples - PySpark Read JSON file into DataFrame
- pyspark.sql.DataFrameReader.json
- Spark SQL Guide - JSON files