我的问题是:
为什么spark创建多个阶段来扫描配置单元表,即使我已经缓存了Dataframe?
在缓存之前重新划分Dataframe时,为什么阶段数会减少?
场景1(sudo代码):
large_source_df.cache,
small_source1_df.cache,
small_source2_df.cache
small_source3_df.cache
res1_df = large_source_df.join(broadcast(small_source1_df)).filter(...)
res2_df = large_source_df.join(broadcast(small_source2_df)).filter(...)
res3_df = large_source_df.join(broadcast(small_source3_df)).filter(...)
union_df = res1_df.union(res2_df).union(res3_df).count
在本例中,大的\u源\u df被使用了三次,即使它被缓存,看起来像是配置单元表被扫描了三次。
场景2(sudo代码):如果我更改了代码,并在缓存之间添加了重新分区,
large_source_df.repartition(200, $"userid").cache,
small_source1_df.cache,
small_source2_df.cache
small_source3_df.cache
res1_df = large_source_df.join(broadcast(small_source1_df)).filter(...)
res2_df = large_source_df.join(broadcast(small_source2_df)).filter(...)
res3_df = large_source_df.join(broadcast(small_source3_df)).filter(...)
union_df = res1_df.union(res2_df).union(res3_df).count
它只扫描一次表。
暂无答案!
目前还没有任何答案,快来回答吧!