到最近假日的距离

lzfw57am 于 2021-06-28 发布在 Hive

关注(0)|答案(1)|浏览(355)

在Pandas中，我有一个类似于

indices = df.dateColumn.apply(holidays.index.searchsorted)
df['nextHolidays'] = holidays.index[indices]
df['previousHolidays'] = holidays.index[indices - 1]

它计算到最近假日的距离并将其存储为新列。 searchsorted http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.series.searchsorted.html 对Pandas来说是一个很好的解决方案，因为这给了我下一个假期的索引，而没有一个高算法复杂度并行化Pandas应用例如，这种方法比并行循环快得多。
如何在spark或hive中实现这一点？

Hive apache-spark apache-spark-sql datediff sorting

来源：https://stackoverflow.com/questions/40752378/spark-sql-distance-to-nearest-holiday

1条答案

按热度按时间

vc9ivgsu1#

这可以使用聚合来完成，但是这种方法的复杂度要比pandas方法高。但是您可以使用udf获得类似的性能。它不会像Pandas那么优雅，但是：
假设假日数据集：

holidays = ['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03']
index = spark.sparkContext.broadcast(sorted(holidays))

以及数据框中2016年的数据集：

from datetime import datetime, timedelta
dates_array = [(datetime(2016, 1, 1) + timedelta(i)).strftime('%Y-%m-%d') for i in range(366)]
from pyspark.sql import Row
df = spark.createDataFrame([Row(date=d) for d in dates_array])

udf可以使用Pandas searchsorted 但需要在遗嘱执行人身上安装Pandas。相反，您可以这样使用计划python：

def nearest_holiday(date):
    last_holiday = index.value[0]
    for next_holiday in index.value:
        if next_holiday >= date:
            break
        last_holiday = next_holiday
    if last_holiday > date:
        last_holiday = None
    if next_holiday < date:
        next_holiday = None
    return (last_holiday, next_holiday)

from pyspark.sql.types import *
return_type = StructType([StructField('last_holiday', StringType()), StructField('next_holiday', StringType())])

from pyspark.sql.functions import udf
nearest_holiday_udf = udf(nearest_holiday, return_type)

可与 withColumn :

df.withColumn('holiday', nearest_holiday_udf('date')).show(5, False)

+----------+-----------------------+
|date      |holiday                |
+----------+-----------------------+
|2016-01-01|[null,2016-01-03]      |
|2016-01-02|[null,2016-01-03]      |
|2016-01-03|[2016-01-03,2016-01-03]|
|2016-01-04|[2016-01-03,2016-03-03]|
|2016-01-05|[2016-01-03,2016-03-03]|
+----------+-----------------------+
only showing top 5 rows

赞(0）回复(0）举报 2021-06-28

我来回答

到最近假日的距离

1条答案

相关问题

热门标签

最新问答