我在用 pyspark
首先将数据插入到一个空表中,然后我将不得不自动化这个过程。使用 pyspark
,如何使元数据失效或刷新数据,以便在impala中正确读取?
以下是我的代码示例:
spark.sql("""
select
gps_data_adj.trip_duration
, gps_data_adj.geometry
, trip_summary.TRIP_HAVERSINE_DISTANCE
, trip_summary.TRIP_GPS_DURATION
, gps_data_adj.HAVERSINE_DISTANCE
, gps_data_adj.GPS_INTERVAL
, gps_data_adj.HAVERSINE_DISTANCE/trip_summary.TRIP_HAVERSINE_DISTANCE AS HAVERSINE_DISTANCE_FRACTION
, gps_data_adj.GPS_INTERVAL/trip_summary.TRIP_GPS_DURATION AS GPS_INTERVAL_FRACTION
, (gps_data_adj.HAVERSINE_DISTANCE/trip_summary.TRIP_HAVERSINE_DISTANCE)*gps_data_adj.trip_distance_travelled AS HAVERSINE_DISTANCE_ADJ
, (gps_data_adj.GPS_INTERVAL/trip_summary.TRIP_GPS_DURATION)*gps_data_adj.trip_duration AS GPS_INTERVAL_ADJ
FROM
gps_data_adj
INNER JOIN
(
SELECT
trip_id
, sum(COSINES_DISTANCE) as TRIP_COSINES_DISTANCE
, sum(HAVERSINE_DISTANCE) as TRIP_HAVERSINE_DISTANCE
, sum(GPS_INTERVAL) AS TRIP_GPS_DURATION
FROM
gps_data_adj
GROUP BY
trip_id
) trip_summary
on gps_data_adj.trip_id = trip_summary.trip_id
""")write.format('parquet').mode('append').insertInto('driving_data_TEST')
暂无答案!
目前还没有任何答案,快来回答吧!