我在pyspark中工作,在得到一个最终的输出表之前,我做了一系列的转换并应用了用户定义的函数。最后一个写入snowflake的命令需要大约25分钟才能运行,因为它也在执行所有的计算,因为spark的计算很慢,直到最后一个调用才进行计算。我想在步骤之前对最终的表进行求值,这样我就可以计算所有转换所需的时间,然后分别计算写入雪花步骤所需的时间。我怎么把两者分开?我试过做:
temp = final_df.show()
temp.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions2) \
.option("dbtable","TEST_SPARK").save()
但我有个错误:
'NoneType' object has no attribute 'write'
和collect()
temp = final_df.collect()
temp.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions2) \
.option("dbtable","TEST_SPARK").save()
但我有个错误:
'list' object has no attribute 'write'
1条答案
按热度按时间vaqhlq811#
你的
temp
Dataframe的结果为.show()
结果没有temp变量的类型,只有dataframe
有.write
方法到源。Try with below code:
```temp = final_df
view records from temp dataframe
temp.show()
temp.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions2)
.option("dbtable","TEST_SPARK").save()
collect collects the data as list and stores into temp variable
temp = final_df.collect()
list attributes doesn't have .write method
final_df.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions2)
.option("dbtable","TEST_SPARK").save()
`Update:`
import time
start_time = time.time()
code until show()
temp = final_df
view records from temp dataframe
temp.show()
end_time = time.time()
print("Total execution time for action: {} seconds".format(end_time - start_time))
start_time_sfw = time.time()
code until show()
final_df.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions2)
.option("dbtable","TEST_SPARK").save()
end_time_sfw = time.time()
print("Total execution time for writing to snowflake: {} seconds".format(end_time_sfw - start_time_sfw))