pyspark窗口函数中意外的帧结构

of1yzvn4 于 2021-05-24 发布在 Spark

关注(0)|答案(1)|浏览(464)

我试图理解在使用窗口函数和看到一些意外行为时框架是如何构造的。这是代码（可以按原样运行）：

from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, FloatType, StringType, IntegerType

spark = SparkSession.builder.getOrCreate()

schema = StructType([StructField('product', StringType(), True),
                     StructField('category', StringType(), True),
                     StructField('revenue', IntegerType(), True)])

spark.createDataFrame(

    [
        ("Thin", "Cell Phone", 6000),
        ("Normal", "Tablet", 1500),
        ("Mini", "Tablet", 5500),
        ("Ultra thin", "Cell Phone", 5000),
        ("Very thin", "Cell Phone", 6000),
        ("Big", "Tablet", 2500),
        ("Bendable", "Cell Phone", 3000),
        ("Foldable", "Cell Phone", 3000),
        ("Pro", "Tablet", 4500),
        ("Pro2", "Tablet", 6500),
     ] ,schema=schema ).createOrReplaceTempView("revenues")

# using collect list to print the contents of the current frame

# using min to show the unexpected behavior

spark.sql("""
    select product,
           category,
           revenue,
           collect_list((product, revenue)) over (partition by category order by revenue desc) frame,
           min(revenue) over (partition by category order by revenue desc) as min
    from revenues
""").show(truncate=False)

这是输出：

+----------+----------+-------+-----------------------------------------------------------------------------------------+----+
|product   |category  |revenue|frame                                                                                    |min |
+----------+----------+-------+-----------------------------------------------------------------------------------------+----+
|Pro2      |Tablet    |6500   |[[Pro2, 6500]]                                                                           |6500|
|Mini      |Tablet    |5500   |[[Pro2, 6500], [Mini, 5500]]                                                             |5500|
|Pro       |Tablet    |4500   |[[Pro2, 6500], [Mini, 5500], [Pro, 4500]]                                                |4500|
|Big       |Tablet    |2500   |[[Pro2, 6500], [Mini, 5500], [Pro, 4500], [Big, 2500]]                                   |2500|
|Normal    |Tablet    |1500   |[[Pro2, 6500], [Mini, 5500], [Pro, 4500], [Big, 2500], [Normal, 1500]]                   |1500|
|Thin      |Cell Phone|6000   |[[Thin, 6000], [Very thin, 6000]]                                                        |6000|
|Very thin |Cell Phone|6000   |[[Thin, 6000], [Very thin, 6000]]                                                        |6000|
|Ultra thin|Cell Phone|5000   |[[Thin, 6000], [Very thin, 6000], [Ultra thin, 5000]]                                    |5000|
|Bendable  |Cell Phone|3000   |[[Thin, 6000], [Very thin, 6000], [Ultra thin, 5000], [Bendable, 3000], [Foldable, 3000]]|3000|
|Foldable  |Cell Phone|3000   |[[Thin, 6000], [Very thin, 6000], [Ultra thin, 5000], [Bendable, 3000], [Foldable, 3000]]|3000|
+----------+----------+-------+-----------------------------------------------------------------------------------------+----+

现在的问题是：
在这种情况下，window函数不指定对帧的任何限制。我希望框架在所有行中都是相似的（并且对应于完整的值组）。在实践中，我看到框架会扩展（不确定根据什么标准）。这会导致“min”函数返回不正确的值-对于“tablet”类别为6500，而不是1500。
如果在窗口规范中删除order by子句，则会看到预期的行为：

collect_list((product, revenue)) over (partition by category) frame, 
min(revenue) over (partition by category) as min

产量：

+----------+----------+-------+-----------------------------------------------------------------------------------------+----+
|product   |category  |revenue|frame                                                                                    |min |
+----------+----------+-------+-----------------------------------------------------------------------------------------+----+
|Normal    |Tablet    |1500   |[[Normal, 1500], [Mini, 5500], [Big, 2500], [Pro, 4500], [Pro2, 6500]]                   |1500|
|Mini      |Tablet    |5500   |[[Normal, 1500], [Mini, 5500], [Big, 2500], [Pro, 4500], [Pro2, 6500]]                   |1500|
|Big       |Tablet    |2500   |[[Normal, 1500], [Mini, 5500], [Big, 2500], [Pro, 4500], [Pro2, 6500]]                   |1500|
|Pro       |Tablet    |4500   |[[Normal, 1500], [Mini, 5500], [Big, 2500], [Pro, 4500], [Pro2, 6500]]                   |1500|
|Pro2      |Tablet    |6500   |[[Normal, 1500], [Mini, 5500], [Big, 2500], [Pro, 4500], [Pro2, 6500]]                   |1500|
|Thin      |Cell Phone|6000   |[[Thin, 6000], [Ultra thin, 5000], [Very thin, 6000], [Bendable, 3000], [Foldable, 3000]]|3000|
|Ultra thin|Cell Phone|5000   |[[Thin, 6000], [Ultra thin, 5000], [Very thin, 6000], [Bendable, 3000], [Foldable, 3000]]|3000|
|Very thin |Cell Phone|6000   |[[Thin, 6000], [Ultra thin, 5000], [Very thin, 6000], [Bendable, 3000], [Foldable, 3000]]|3000|
|Bendable  |Cell Phone|3000   |[[Thin, 6000], [Ultra thin, 5000], [Very thin, 6000], [Bendable, 3000], [Foldable, 3000]]|3000|
|Foldable  |Cell Phone|3000   |[[Thin, 6000], [Ultra thin, 5000], [Very thin, 6000], [Bendable, 3000], [Foldable, 3000]]|3000|
+----------+----------+-------+-----------------------------------------------------------------------------------------+----+

我的假设不正确吗？如果是的话-在这两种情况下，框架结构的力学原理是什么？
在spark 3.0.1上测试

sql apache-spark pyspark window-functions

来源：https://stackoverflow.com/questions/64123681/unexpected-frame-structure-in-pyspark-window-function