将数据帧从Azure数据块写入/保存到Azure文件共享

问题描述

如何从azure databricks spark作业中写入azure文件共享。

我配置了Hadoop存储密钥和值。

spark.sparkContext.hadoopConfiguration.set(
  "fs.azure.account.key.STORAGEKEY.file.core.windows.net","SECRETVALUE"
)


val wasbFileShare =
    s"wasbs://[email protected]/testPath"

df.coalesce(1).write.mode("overwrite").csv(wasbBlob)

当尝试将数据帧保存到azure文件共享时，尽管存在URI，但我看到以下未找到资源错误。

 Exception in thread "main" org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: The requested URI does not represent any resource on the server.

解决方法

不幸的是，Azure数据块不支持读取和写入Azure文件共享。

Azure Databricks支持的数据源：https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/

我建议您提供有关以下内容的反馈：

https://feedback.azure.com/forums/909463-azure-databricks

您在这些论坛中分享的所有反馈将由负责构建Azure的Microsoft工程团队监视和审查。

您可以签出解决类似问题的SO线程：Databricks and Azure Files

下面是用于将CSV数据直接写入Azure Databricks Notebook中的Azure blob存储容器的代码段。

# Configure blob storage account access key globally
spark.conf.set("fs.azure.account.key.chepra.blob.core.windows.net","gv7nVIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXdlOiA==")
output_container_path = "wasbs://[email protected]"
output_blob_folder = "%s/wrangled_data_folder" % output_container_path

# write the dataframe as a single file to blob storage
(dataframe
 .coalesce(1)
 .write
 .mode("overwrite")
 .option("header","true")
 .format("com.databricks.spark.csv")
 .save(output_blob_folder))

# Get the name of the wrangled-data CSV file that was just saved to Azure blob storage (it starts with 'part-')
files = dbutils.fs.ls(output_blob_folder)
output_file = [x for x in files if x.name.startswith("part-")]

# Move the wrangled-data CSV file from a sub-folder (wrangled_data_folder) to the root of the blob container
# While simultaneously changing the file name
dbutils.fs.mv(output_file[0].path,"%s/predict-transform-output.csv" % output_container_path)

从数据块连接到 azure 文件共享的步骤

首先使用 Databricks 中的 pip install 为 Python 安装 Microsoft Azure 存储文件共享客户端库。 https://pypi.org/project/azure-storage-file-share/

安装后，创建一个存储帐户。然后你可以从数据块创建一个文件共享

from azure.storage.fileshare import ShareClient

share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>",share_name="<file share name that you want to create>")

share.create_share()

这段代码是通过databricks上传一个文件到fileshare中

from azure.storage.fileshare import ShareFileClient
 
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>",share_name="<your_fileshare_name>",file_path="my_file")
 
with open("./SampleSource.txt","rb") as source_file:
    file_client.upload_file(source_file)

参考此链接了解更多信息https://pypi.org/project/azure-storage-file-share/

apache-spark azure azure azure-databricks azure-files