由于pyspark中的filereadexception，任务未完成

omhiaaxx 于 2021-07-13 发布在 Spark

关注(0)|答案(0)|浏览(270)

执行查询时出现以下错误：

FileReadException: Error while reading file adl://lpdatalakepro.azuredatalakestore.net/pr-nn/squad7/cz/fact_table/date_key=2020-11-17/is_lidl_plus=1/source=tickets/subtype=trx/part-00260-tid-6476907462999503725-d1ead00d-a0c5-47da-bfa7-d1a1d0c5c87a-19515-20.c000.snappy.orc. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Caused by: FileNotFoundException: File/Folder does not exist: /pr-nn/squad7/cz/fact_table/date_key=2020-11-17/is_lidl_plus=1/source=tickets/subtype=trx/part-00260-tid-6476907462999503725-d1ead00d-a0c5-47da-bfa7-d1a1d0c5c87a-19515-20.c000.snappy.orc [af36ffd1-a779-47af-9770-087ece32e8e4][2021-02-18T07:21:18.9648845-08:00] [ServerRequestId:af36ffd1-a779-47af-9770-087ece32e8e4]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1015.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1015.0 (TID 12611, 10.32.14.34, executor 4): com.databricks.sql.io.FileReadException: Error while reading file adl://lpdatalakepro.azuredatalakestore.net/pr-nn/squad7/cz/fact_table/date_key=2020-11-17/is_lidl_plus=1/source=tickets/subtype=trx/part-00260-tid-6476907462999503725-d1ead00d-a0c5-47da-bfa7-d1a1d0c5c87a-19515-20.c000.snappy.orc. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame invo

可以想象，所需的输出是查询。以下是使用的查询：

last_30d_start = '2020-11-07'
last_30d_end = '2020-12-06'

spark.conf.set("spark.sql.broadcastTimeout",36000)

last30d = (spark.table(f'nn_squad7_{country}.fact_table')
  .filter(f.col('date_key').between(last_30d_start, last_30d_end))
  .filter(f.col('source') == 'tickets')
  .filter(f.col('is_lidl_plus')==1)
  .filter(f.col('subtype') == 'trx')
  .filter(f.col('is_trx_contable') == 1)
  .filter(f.col('trx_detail_total_amount') > 0)
  .withColumn('trx_amount_net',f.col('trx_detail_total_amount')-f.col('trx_detail_vat_amount'))
  .withColumn('week',f.concat(f.year('date_key'),f.lit("-"),f.weekofyear('date_key')))
  .select('ticket_id','week','trx_amount_net','customer_id').distinct()
  .join(tickets_with_promo, 'ticket_id', 'left')
  #.join(segment, 'customer_id', 'inner')
  .groupby('promo')
  .agg(f.countDistinct('ticket_id').alias('tickets'),
       f.countDistinct("customer_id").alias('users'),
       f.sum('trx_amount_net').alias('trx_amount_net'))
  .withColumn('avg_basket',f.col('trx_amount_net')/f.col('tickets'))
)

display(last30d)

我查过了，我没有改变datalake的任何位置，笔记本也没有任何改变。我不知道为什么会出现这个错误。你知道吗？谢谢！

DataFrame apache-spark apache-spark-sql

来源：https://stackoverflow.com/questions/66262902/task-not-completed-due-to-filereadexception-in-pyspark

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

由于pyspark中的filereadexception，任务未完成

暂无答案！

相关问题

热门标签

最新问答