pyspark是计算公共列上两个表的平均值的有效方法

c9qzyr3d 于 2021-06-25 发布在 Hive

关注(0)|答案(0)|浏览(215)

假设我有两个类似结构的pysparkDataframet1和t2。

T1
ID | balance | interest | T1_non_repeated_1 | T1_non_repeated_2
T2
ID | balance | interest | T2_non_repeated_1 | T2_non_repeated_2 | T2_not_repeated_3

我想创建一个表，其中包含公共列匹配的这两个列的平均值，以t2id为基。
到目前为止，我对pyspark（伪代码）的想法是

T2.left_join(T1).withColumn("balance",(balance1+balance2)/2).withColumn("interest", (interest1+interest2)/2)....

我的问题是：
在pyspark中这是一个很长的命令，假设两个表都有100个公共列。有没有其他方法可以编写不同的命令并为所有100个公共列动态生成命令？
欢迎提出其他建议。
谢谢您

Hive python pyspark apache-spark-sql pyspark-dataframes

来源：https://stackoverflow.com/questions/60379752/pyspark-efficient-way-to-compute-average-of-two-tables-on-common-columns

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

pyspark是计算公共列上两个表的平均值的有效方法

暂无答案！

相关问题

热门标签

最新问答