在写入snowflake之前保存sparkDataframe

yquaqz18  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(463)

我在pyspark中工作,在得到一个最终的输出表之前,我做了一系列的转换并应用了用户定义的函数。最后一个写入snowflake的命令需要大约25分钟才能运行,因为它也在执行所有的计算,因为spark的计算很慢,直到最后一个调用才进行计算。我想在步骤之前对最终的表进行求值,这样我就可以计算所有转换所需的时间,然后分别计算写入雪花步骤所需的时间。我怎么把两者分开?我试过做:

temp = final_df.show() 

temp.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions2) \
.option("dbtable","TEST_SPARK").save()

但我有个错误:

'NoneType' object has no attribute 'write'

和collect()

temp = final_df.collect() 

temp.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions2) \
.option("dbtable","TEST_SPARK").save()

但我有个错误:

'list' object has no attribute 'write'
vaqhlq81

vaqhlq811#

你的 temp Dataframe的结果为 .show() 结果没有temp变量的类型,只有 dataframe.write 方法到源。 Try with below code: ```
temp = final_df

view records from temp dataframe

temp.show()

temp.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions2)
.option("dbtable","TEST_SPARK").save()

collect collects the data as list and stores into temp variable

temp = final_df.collect()

list attributes doesn't have .write method

final_df.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions2)
.option("dbtable","TEST_SPARK").save()
`Update:`
import time
start_time = time.time()

code until show()

temp = final_df

view records from temp dataframe

temp.show()
end_time = time.time()
print("Total execution time for action: {} seconds".format(end_time - start_time))

start_time_sfw = time.time()

code until show()

final_df.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions2)
.option("dbtable","TEST_SPARK").save()
end_time_sfw = time.time()
print("Total execution time for writing to snowflake: {} seconds".format(end_time_sfw - start_time_sfw))

相关问题