执行查询时出现以下错误:
FileReadException: Error while reading file adl://lpdatalakepro.azuredatalakestore.net/pr-nn/squad7/cz/fact_table/date_key=2020-11-17/is_lidl_plus=1/source=tickets/subtype=trx/part-00260-tid-6476907462999503725-d1ead00d-a0c5-47da-bfa7-d1a1d0c5c87a-19515-20.c000.snappy.orc. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Caused by: FileNotFoundException: File/Folder does not exist: /pr-nn/squad7/cz/fact_table/date_key=2020-11-17/is_lidl_plus=1/source=tickets/subtype=trx/part-00260-tid-6476907462999503725-d1ead00d-a0c5-47da-bfa7-d1a1d0c5c87a-19515-20.c000.snappy.orc [af36ffd1-a779-47af-9770-087ece32e8e4][2021-02-18T07:21:18.9648845-08:00] [ServerRequestId:af36ffd1-a779-47af-9770-087ece32e8e4]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1015.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1015.0 (TID 12611, 10.32.14.34, executor 4): com.databricks.sql.io.FileReadException: Error while reading file adl://lpdatalakepro.azuredatalakestore.net/pr-nn/squad7/cz/fact_table/date_key=2020-11-17/is_lidl_plus=1/source=tickets/subtype=trx/part-00260-tid-6476907462999503725-d1ead00d-a0c5-47da-bfa7-d1a1d0c5c87a-19515-20.c000.snappy.orc. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame invo
可以想象,所需的输出是查询。以下是使用的查询:
last_30d_start = '2020-11-07'
last_30d_end = '2020-12-06'
spark.conf.set("spark.sql.broadcastTimeout",36000)
last30d = (spark.table(f'nn_squad7_{country}.fact_table')
.filter(f.col('date_key').between(last_30d_start, last_30d_end))
.filter(f.col('source') == 'tickets')
.filter(f.col('is_lidl_plus')==1)
.filter(f.col('subtype') == 'trx')
.filter(f.col('is_trx_contable') == 1)
.filter(f.col('trx_detail_total_amount') > 0)
.withColumn('trx_amount_net',f.col('trx_detail_total_amount')-f.col('trx_detail_vat_amount'))
.withColumn('week',f.concat(f.year('date_key'),f.lit("-"),f.weekofyear('date_key')))
.select('ticket_id','week','trx_amount_net','customer_id').distinct()
.join(tickets_with_promo, 'ticket_id', 'left')
#.join(segment, 'customer_id', 'inner')
.groupby('promo')
.agg(f.countDistinct('ticket_id').alias('tickets'),
f.countDistinct("customer_id").alias('users'),
f.sum('trx_amount_net').alias('trx_amount_net'))
.withColumn('avg_basket',f.col('trx_amount_net')/f.col('tickets'))
)
display(last30d)
我查过了,我没有改变datalake的任何位置,笔记本也没有任何改变。我不知道为什么会出现这个错误。你知道吗?谢谢!
暂无答案!
目前还没有任何答案,快来回答吧!