Pyspark -顺序估计函数,-并行

piah890a 于 2023-11-16 发布在 Spark

关注(0)|答案(1)|浏览(150)

我有一个函数。像下面，并希望计算估计groupBy键。数据需要按时间在组内排序。
看起来用spark.df不容易/不可能，所以我尝试了rdd，但是即使我使用自定义的分区（组的数量），part/recors？中的“shuffle”也会部分返回错误的结果（不是所有组）。
我如何才能避免这种情况，并计算它在并行与一个干净的分组给定的排序？提前感谢，基督教

func

def estimate(rows):
    estimated = float(0.0)
    result = []
    for row in rows:
        time, key, available, level, reduction, total = row
        if level >= 0.3:
           estimated += float(available -  reduction)
           estimated = min(estimated, total)
        else:
            estimated =float(0.0)
        result.append((time, key, available, level, reduction, total, estimated))
    return iter(result)

字符串

使用mapPartitions的方法

def partition_func(key):
    return hash(key)
rdd = df_input.rdd.map(lambda row: (row["key"], row))
partitioned_rdd = rdd.partitionBy(numPartitions=n, partitionFunc=partition_func)
new_df = (partitioned_rdd.map(lambda x: x[1])        
          .mapPartitions(estimate)
          .toDF()
         )

型
使用groupByKeys.flatMapValues（estimate）的方法在并行化中也不干净。

pyspark

来源：https://stackoverflow.com/questions/77352924/pyspark-sequential-estimate-function-parallel

1条答案

按热度按时间

kq4fsx7k1#

我会将数组按key分组，然后使用applyInPandas计算每组的估计值。这种方法将并行分配对应于每个唯一键的组，以提高计算效率。

def estimate(pdf):
    acc, result = 0, []
    pdf = pdf.sort_values('time')
    for r in pdf.itertuples():
        acc += r.available - r.reduction
        acc = min(acc, r.total)
        result.append(acc)
    return pdf.assign(estimated=result)
schema = T.StructType([*df_input.schema.fields, T.StructField('estimated', T.DoubleType())])
df_result = df_input.groupBy('key').applyInPandas(estimate, schema=schema)

字符串

赞(0）回复(0）举报 2023-11-16

我来回答

Pyspark -顺序估计函数,-并行

func

使用mapPartitions的方法

1条答案

相关问题

热门标签

最新问答