pyspark：对于每一行，根据条件计算另一个表

gajydyqb 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(482)

对于表1中的每一行，我尝试计算表2中的行，并根据表1中的值来满足条件。
表1中的年龄应介于表2的开始年龄和结束年龄之间，或等于开始年龄和结束年龄。
是否可以使用自定义项和withcolumn？我尝试了两种方法，比如使用withcolumn和带有自定义项的withcolumn，但两种方法都失败了。

def counter(a):
    return table2.where((table2.StartAge <= a) & (table2.EndAge >=a)).count()

counter_udf = udf(lambda age: counter(age), IntegerType())

table1 = table1.withColumn('Count', counter_udf('Age ID'))

这有道理吗？谢谢。
输入和输出示例：

python apache-spark pyspark apache-spark-sql databricks

来源：https://stackoverflow.com/questions/63119898/pyspark-for-each-row-count-another-table-based-on-condition

2条答案

按热度按时间

of1yzvn41#

如果要在脚本中使用自定义项，必须首先向spark注册它。
使用这行代码有助于修复错误：

_ = spark.udf.register("counter_udf", counter_udf)

赞(0）回复(0）举报 2021-05-27

qaxu7uf22#

看看这个。您可以使用sparksql实现它。

from pyspark.sql import SparkSession

    spark = SparkSession.builder \
        .appName('SO')\
        .getOrCreate()

    sc= spark.sparkContext

    df = sc.parallelize([([3]), ([4]), ([5])]).toDF(["age"])

    df1 = spark.createDataFrame([(0, 10), (7, 15), (5, 10), (3, 20), (5, 35), (4, 5),]
                           , ['age_start', 'age_end'])

    df.createTempView("table1")

    df1.createTempView("table2")

    spark.sql('select  t1.age as age_id, count(*) as count from table1 t1 join table2  t2 on  t1.age >=t2.age_start and t1.age<=t2.age_end group by t1.age order by count').show()

    # +------+-----+
    # |age_id|count|
    # +------+-----+
    # |     3|    2|
    # |     4|    3|
    # |     5|    5|
    # +------+-----+

赞(0）回复(0）举报 2021-05-27

我来回答

pyspark：对于每一行，根据条件计算另一个表

2条答案

相关问题

热门标签

最新问答