我正在尝试使用pyspark stream根据groupby操作获取另一列的不同值,但得到的计数是正确的。
Function created:
from pyspark.sql.functions import weekofyear,window,approx_count_distinct
def silverToGold(silverPath, goldPath, queryName):
(spark.readStream
.format("delta")
.load(silverPath)
.withColumn("week",weekofyear("eventDate"))
#.groupBy(window(col(("week")).cast("timestamp"),"5 minute")).approx_count_distinct("device_id")
# .withColumn("WAU",col("window.start"))
# .drop("window")
.groupBy("week").agg(approx_distinct.count("device_id").alias("WAU"))
.writeStream
.format("delta")
.option("checkpointLocation",goldPath + "/_checkpoint")
#.option("streamName",queryName)
.queryName(queryName)
.outputMode("complete")
.start(goldPath)
#return queryName
)
Expected Result:
week WAU
1 7
2 4
3 9
4 9
Actual Result:
week WAU
1 7259
2 7427
3 7739
4 7076
输入数据示例:
以文本格式输入数据:
设备id,事件名称,客户端事件时间,事件日期,设备类型00007d948fbe4d239b45fe59bfbb7e64,分数调整,2018-06-01t16:55:40.000+00002018-06-01,android 00007d948fbe4d239b45fe59bfbb7e64,分数调整,2018-06-01t16:55:34.000+00002018-06-01,android 0000a99151154e4eb14c675e8b42db34,分数调整,2019-08-18t13:39:36.000+00002019-08-18,ios 0000b1e931d947b197385ac1cbb25779,分数调整,2018-07-16t09:13:45.000+00002018-07-16,android 0003939e705949e4a184e0a853b6e0af,分数调整,2018-07-17t17:59:05.000+00002018-07-17,android 0003e14ca9ba4198b51cec7d2761d391,分数调整,2018-06-10t09:09:12.000+00002018-06-10,ios 00056f7c73c9497180f2e0900a0626e3,分数调整,2019-07-05t18:31:10.000+00002019-07-05,ios 0006ace2d1db46ba94b802d80a43c20f,分数调整,2018-07-05t14:31:43.000+00002018-07-05,ios 000718c45e164fb2b017f146a6b66b7e,分数调整,2019-03-26t08:25:08.000+00002019-03-26,android 000807F2EA524B7E27DF8D44AB930,purchaseevent,2019-03-26t22:28:17.000+00002019-03-26,android
有什么建议吗
1条答案
按热度按时间7xzttuei1#