My Digital Garden

Pyspark - reading data from files

Pyspark - reading data from files

Context

This note is scoped to files in Azure Datalake Gen 2 storage accessed from Synapse Analytics

Storage account must be linked to the Synapse Analytics workspace, and the relevant permissions set. (see docs).

In Synapse Analytics Spark Notebooks there is already a SparkSession automatically created, stored in a variable called spark. Also there is a variable for SparkContext which is called sc. Users can access these variables directly and should not change the values of these variables. docs

Reading different formats into a DataFrame

Parquet

df = spark.read.parquet('abfss://container@storageAccount/path/file.parquet')

To read all the Parquet files in a folder into one DataFrame use wildcard like this:

 df = spark.read.parquet('abfss://container@storageAccount/path/*.parquet')

If the source data is partitioned, for example by createdYear and you want one partition then state that in the path, for example:

 df = spark.read.parquet('abfss://container@storageAccount/pathpart/createdYear=2021/*.parquet')

If the source data is partitioned, for example by createdYear and you want all of the partitions merged into one DataFrame then use wildcards in the path, for example:

 df = spark.read.parquet('abfss://container@storageAccount/pathpart/createdYear=*/*.parquet')

CSV

df = spark.read.csv('abfss://container@storageAccount/path/file.csv')

Header row - iIf the data file has a header row, this needs to be specified:

 df = spark.read.option("header",True) \
.csv('abfss://container@storageAccount/path/file.csv')

Delimiter - by default is , can be overridden, for example if it is pipe:

df = spark.read.option("delimiter", '|') \
.csv('abfss://container@storageAccount/path/file.csv')

Options can be combined either by chaining calls to .option(...) or by using .options(..):

df = spark.read.options(header='True', delimiter='|') \
.csv('abfss://container@storageAccount/path/file.csv')

Many other options, see the references:

JSON

File contains JSON in a single line

df = spark.read.json('abfss://container@storageAccount/path/file.json')

File contains JSON spread across multiple lines

df = spark.read.option("multiline","true") \
.json('abfss://container@storageAccount/path/file.json')

Many other options, see the references:

See also