我有一个NYC出租车数据的数据集,我试图过滤它。这个数据集的模式如下:
root
|-- medallion: string (nullable = true)
|-- hack_license: string (nullable = true)
|-- vendor_id: string (nullable = true)
|-- rate_code: integer (nullable = true)
|-- store_and_fwd_flag: string (nullable = true)
|-- pickup_datetime: timestamp (nullable = true)
|-- dropoff_datetime: timestamp (nullable = true)
|-- passenger_count: integer (nullable = true)
|-- trip_time_in_secs: integer (nullable = true)
|-- trip_distance: double (nullable = true)
|-- pickup_longitude: double (nullable = true)
|-- pickup_latitude: double (nullable = true)
|-- dropoff_longitude: double (nullable = true)
|-- dropoff_latitude: double (nullable = true)
我想过滤数据集,以便排除任何上下班时间不在上午9点到下午5点之间的条目。但是我在编写withColumn
函数中使用的helper方法时遇到了麻烦。下面是我为withColumn
调用所做的工作:
from pyspark.sql.window import Window
from pyspark.sql import functions as fun
taxi_raw.withColumn("pickup_datetime", remove_pickup_times(fun.col("pickup_datetime"))
taxi_raw.withColumn("dropoff_datetime", remove_dropoff_times(fun.col("dropoff_datetime"))
下面是到目前为止我为helper方法所做的工作:
import datetime
def remove_pickup_times(pickup_datetime):
time_start = datetime.time(9,0,0)
time_end = datetime.time(17,0,0)
if(pickup_datetime.time() >= time_start and pickup_datetime.time() <= time_end):
//insert code to remove entry from dataset
def remove_dropoff_times(dropoff_datetime):
time_start = datetime.time(9,0,0)
time_end = datetime.time(17,0,0)
if(dropoff_datetime.time() >= time_start and dropoff_datetime.time() <= time_end):
//insert code to remove entry from dataset
2条答案
按热度按时间9lowa7mx1#
您可以使用原生Spark函数。
date_format
在提供“HH:mm:ss”格式时将仅提取时间。输入:
工作时间以外出差的脚本:
工作时间内出差的脚本:
ifmq2ha22#
Spark在内部将时间戳存储为Epoch Milliseconds,因此可以将时间戳转换为long并进行检查。
尝试使用此spark.sql