考虑如下所示的数组。我有3组数组:
阵列1:
C1 C2 C3
1 2 3
9 5 6
阵列2:
C2 C3 C4
11 12 13
10 15 16
阵列3:
C1 C4
111 112
110 115
我需要的输出如下,输入我可以得到任何一个值的c1,…,c4,但当加入我需要得到正确的值,如果值不存在,那么它应该是零。
预期产量:
C1 C2 C3 C4
1 2 3 0
9 5 6 0
0 11 12 13
0 10 15 16
111 0 0 112
110 0 0 115
我已经编写了pyspark代码,但是我已经硬编码了新列的值和它的原始值,我需要将下面的代码转换为方法重载,这样我就可以使用这个脚本作为自动脚本。我只需要使用python/pyspark而不是pandas。
import pyspark
from pyspark import SparkContext
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
sqlContext = pyspark.SQLContext(pyspark.SparkContext())
df01 = sqlContext.createDataFrame([(1, 2, 3), (9, 5, 6)], ("C1", "C2", "C3"))
df02 = sqlContext.createDataFrame([(11,12, 13), (10, 15, 16)], ("C2", "C3", "C4"))
df03 = sqlContext.createDataFrame([(111,112), (110, 115)], ("C1", "C4"))
df01_add = df01.withColumn("C4", lit(0)).select("c1","c2","c3","c4")
df02_add = df02.withColumn("C1", lit(0)).select("c1","c2","c3","c4")
df03_add = df03.withColumn("C2", lit(0)).withColumn("C3", lit(0)).select("c1","c2","c3","c4")
df_uni = df01_add.union(df02_add).union(df03_add)
df_uni.show()
方法重载示例:
class Student:
def ___Init__ (self,m1,m2):
self.m1 = m1
self.m2 = m2
def sum(self,c1=None,c2=None,c3=None,c4=None):
s = 0
if c1!= None and c2 != None and c3 != None:
s = c1+c2+c3
elif c1 != None and c2 != None:
s = c1+c2
else:
s = c1
return s
print(s1.sum(55,65,23))
3条答案
按热度按时间6qqygrtg1#
也许有很多更好的方法可以做到这一点,但也许下面的方法对将来的任何人都有用。
输出:
i5desfxk2#
这是scala的版本-
https://stackoverflow.com/a/60702657/9445912
有问题的-
spark-将具有不同模式(列名和序列)的Dataframe合并/联合到具有主公共模式的Dataframe
xzlaal3s3#
我会努力的