pandas 如何使用databricks实用程序(dbutils)将 Dataframe 保存到Azure datalake中的csv文件夹中

y4ekin9u 于 2022-12-09 发布在其他

关注(0)|答案(1)|浏览(143)

我有一个名为data的数据框，我正在使用www.example.com _csv将其保存为csv文件到我的datalake中pandas.to。但是，将文件保存为csv需要花费很多时间。有人能告诉我如何使用dbutils将csv文件保存到datalake中吗？另外，请确认创建目录的代码（如果不存在）是否正确

d = data.groupby(['Col1', 'Col2'])
for k, Dates in d:
    if not Dates.empty:
        PATH = /dbfs/mnt/data/../'
        try:
            dbutils.fs.ls(PATH)                      
            pass
        except Exception as e:
            if 'java.io.FileNotFoundException' in str(e):
                dbutils.fs.mkdirs(PATH)                              
        Dates.to_csv(PATH+f'{Day}.csv',index=False)

pandas

来源：https://stackoverflow.com/questions/74640995/how-to-save-a-dataframe-to-csv-in-azure-datalake-in-the-specified-folder-using-d

1条答案

按热度按时间

uemypmqf1#

In dbutils there is only coalesce and partition methods for saving files to csv and they will create files with Random names to create files in required names we use pandas to_csv method

Method 1

From Azure Databricks home, you can go to “Upload Data” (under Common Tasks) → “DBFS” → “FileStore”.
I created a folder “df” and saved a data frame “Sample” into CSV. It is important to use coalesce(1) since it saves the data frame as a whole.

Sample.coalesce(1).write.format("com.databricks.spark.csv").option("header","true").save("dbfs:/FileStore/df/Sample.csv")

The “part-00000” is the CSV file
Download file to local and rename if required
Upload the csv file manually to datalake storage as follows

Method 2

Data in the DBFS Databricks File System

df = spark.read.format("csv").option("recursiveFileLookup", "true").option("inferSchema", "true").option("header", "true").load("dbfs:/myfolder/sample/")
df.show()

Configure storage account access key globally

spark.conf.set("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<<ACCESS KEY>>")

Configure storage account folder

output_container_path = "abfss://<<filesystem>>@<<Storage_Name>>.dfs.core.windows.net/<<DirectoryName>>"
 output_blob_folder = "%s/CSV_data_folder" % output_container_path

write the dataframe as a single csv file to storage

(df
    .coalesce(1)
    .write
    .mode("overwrite")
    .option("header", "true")
    .format("com.databricks.spark.csv")
    .save(output_blob_folder))