从单个文件读取多个数据集- pyspark

bzzcjhmw 于 2023-03-01 发布在 Spark

关注(0)|答案(1)|浏览(375)

我有一个固定宽度的文件，看起来像这样：

H10001234567ABC
D123......
D124......
D125......
T10000003
H10001234567DEF
D234......
D235......
D236......
T10000003

以H开头的第一行是标题，它对应于第一个数据集，后面是详细数据，然后是带有详细记录数量的尾部记录。一个文件可以有多个这样的分组。我的目标是将标题、详细信息和尾部记录分别加载到3个不同的 Dataframe 中，我应该有一个公共密钥将它们连接在一起，如下所示：

Header DF:

Key | Attribute1 | Attribute2 | Attribute3
1   | H1000      | 1234567    | ABC
2   | H1000      | 1234567    | DEF

Detail DF:

Key | Attribute1 | ....
1   | 123        | ....
1   | 124        | ....
1   | 125        | ....
2   | 234        | ....
2   | 235        | ....
2   | 236        | ....

Trailer DF:

Key | Attribute1 | Count
1   | 1000       | 3
2   | 1000       | 3

最好的办法是什么？谢谢。

pyspark

来源：https://stackoverflow.com/questions/75547344/read-multiple-datasets-from-a-single-file-pyspark

1条答案

按热度按时间

yr9zkbsy1#

使用以下解决方案，您可以将记录分为3个 Dataframe ，然后使用substring（link）转换将数据字符串分为列。

import pyspark.sql.functions as f

input_df = spark.read.text('<path_to_input_file>')

header_df = (
    input_df
    .where(f.col('value').rlike('^H.*$'))
)

detail_df = (
    input_df
    .where(f.col('value').rlike('^D.*$'))
)

trailer_df = (
    input_df
    .where(f.col('value').rlike('^T.*$'))
)

赞(0）回复(0）举报 2023-03-01

我来回答

从单个文件读取多个数据集- pyspark

1条答案

相关问题

热门标签

最新问答