pyspark 正在从Azure Blob读取架构多个parquet文件

qrjkbowd 于 2022-11-01 发布在 Spark

关注(0)|答案(1)|浏览(170)

我想通过数据块从Azure Blob存储中读取多个parquet文件，但问题是架构。如果我将interferSchema设为True，则它将从它将读取的第一个文件中取出架构。在读取多个文件或读取一定量的数据后，是否有任何方法可以推断架构。我们不想将mergeSchema设为True。

pyspark

来源：https://stackoverflow.com/questions/73816454/reading-schema-multiple-parquet-files-from-azure-blob

1条答案

按热度按时间

krugob8w1#

在读取大量文件或读取一定量的数据后，是否有任何方法可以推断出模式。

AFAIK，可能没有任何此类方法

你可以尝试下面的方法，我能够得到想要的结果.
首先从文件列表中获取列数较多的parquet文件路径。现在，获取该特定文件的模式，并将该模式应用于所有文件。
请完成以下示例演示：
这些是我在Blob存储中的 parquet 文件，其中**xyz.parquet**比其他文件多了一列。

安装后，获取文件路径列表，并找到具有更多列数的 parquet 。

现在，获取这个parquet文件的模式

对多个parquet文件强制使用这个自定义模式，这样我们就可以得到具有所需模式的 Dataframe 。

您可以看到，我们得到了额外的列，对于没有该列的文件，该值将为空。

我的源代码：


# File paths list

fileinfo=dbutils.fs.ls("dbfs:/mnt/blob1/myfiles/")
paths=[i[0] for i in fileinfo]
print(paths)

# Finding the required file path

col_num=[]
for file_path in paths:
    df=spark.read.parquet(file_path)
    col_num.append(len(df.columns))
i=col_num.index(max(col_num))
our_path=paths[i]
print(our_path)

# Schema

myschema=spark.read.parquet(our_path).schema
print(myschema)

# Enforcing the schema to all files

result_df = spark.read.format("parquet").schema(myschema).load("dbfs:/mnt/blob1/myfiles/")
display(result_df)

赞(0）回复(0）举报 2022-11-01

我来回答

pyspark 正在从Azure Blob读取架构多个parquet文件

1条答案

相关问题

热门标签

最新问答