pyspark基于dataframegroupby生成多个文件

dojqjjoe 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(412)

我可以分组大数据集，使多个csv，excel文件与Pandas数据框。但是如何对pysparkDataframe执行相同的操作，将700k条记录分组到大约230个组中，并使230个csv文件面向全国。
使用Pandas

grouped = df.groupby("country_code")

# run this to generate separate Excel files

for country_code, group in grouped:
    group.to_excel(excel_writer=f"{country_code}.xlsx", sheet_name=country_code, index=False)

有了pyspark数据框，当我试着喜欢这个的时候-

for country_code, df_country in df.groupBy('country_code'):
    print(country_code,df_country.show(1))

它回来了，
typeerror:“groupeddata”对象不可iterable

python apache-spark pyspark apache-spark-sql pandas

来源：https://stackoverflow.com/questions/63146486/pyspark-make-multiple-files-based-on-dataframe-groupby

2条答案

按热度按时间

j2datikz1#

使用 partitionBy 在编写时，使每个分区都基于指定的列( country_code 在你的情况下）。
这里有更多的信息。

赞(0）回复(0）举报 2021-05-27

zpqajqem2#

如果您的要求是将所有国家/地区的数据保存在不同的文件中，您可以通过对数据进行分区来实现，但是您将获得每个国家/地区的文件夹，而不是文件，因为spark无法将数据直接保存到文件中。
每当调用Dataframe编写器时，spark就会创建文件夹。

df.write.partitionBy('country_code').csv(path)

输出将是带有相应国家数据的多个文件夹

path/country_code=india/part-0000.csv
path/country_code=australia/part-0000.csv

如果希望每个文件夹中都有一个文件，可以将数据重新分区为

df.repartition('country_code').write.partitionBy('country_code').csv(path)

赞(0）回复(0）举报 2021-05-27

我来回答

pyspark基于dataframegroupby生成多个文件

2条答案

相关问题

热门标签

最新问答