Azure Data Lake Storage Gen2 (ADLS Gen2) is a highly scalable and secure data lake solution that allows organizations to store and analyze vast amounts of data. One common requirement when working with ADLS Gen2 is authenticating applications or services to access the data stored within. One of the authentication methods is by using a Service Principal. In this post, I have provided the Azure ADLS Gen2 authentication code using Service Principle.

# Import required libraries
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

# Define the ADLS Gen 2 credentials
#Azure tenent ID
tenant_id = "<your_tanent_id>"

# Service principle ID
client_id = "<your_service_principle_id>"

#Fetching credentials from Azure key Vault using Databricks secret scope
# You can directly supply or use the standard way supported by your organization
client_secret = dbutils.secrets.get(scope='databricks-kv', key='sp-secrests')

# ADLS Gen2 Parameters
storage_account_name = "<your-storage-account-name>"
container_name = "<your-container-name>"

# create Spark session
conf = SparkConf().setAppName("RowCount")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

# set configuration for accessing ADLS Gen 2
spark.conf.set(
    f"fs.azure.account.auth.type.{storage_account_name}.dfs.core.windows.net",
    "OAuth",
)
spark.conf.set(
    f"fs.azure.account.oauth.provider.type.{storage_account_name}.dfs.core.windows.net",
    "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
)
spark.conf.set(
    f"fs.azure.account.oauth2.client.id.{storage_account_name}.dfs.core.windows.net",
    client_id,
)
spark.conf.set(
    f"fs.azure.account.oauth2.client.secret.{storage_account_name}.dfs.core.windows.net",
    client_secret,
)
spark.conf.set(
    f"fs.azure.account.oauth2.client.endpoint.{storage_account_name}.dfs.core.windows.net",
    f"https://login.microsoftonline.com/{tenant_id}/oauth2/token",
)

# Define input and output paths
landing_file_path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/POC/TEST/"
print(landing_file_path)

# Define file names and partitions
slice_path = "F0005"

# Define delimiter
delimiter_input = ","

# Define landing location and load path
landing_location = landing_file_path + slice_path + "/*.csv"
print(landing_location)

# Load data from files
extracted_data = spark.read.csv(landing_location, header=True, inferSchema=True, ignoreLeadingWhiteSpace=True, ignoreTrailingWhiteSpace=True, sep=delimiter_input)

extracted_data.show(10)

I hope you found this blog post helpful and informative. If you enjoyed reading, please consider subscribing to our blog for more Azure-related content. Please don’t hesitate to leave a comment if there’s a specific topic, you’d like me to cover in the future. Thank you for your support!