HDFS 使用快速压缩的有效ORC文件的最小大小应该是多少

uqcuzwp8  于 2022-12-09  发布在  HDFS
关注(0)|答案(1)|浏览(233)

The scenario I am dealing with here is each hour 10k orc files are getting generated in HDFS by spark streaming application and after the end of the hour, a spark merge job runs and merge those small files in some bigger chunk and write it to hive landing path for external table to pick up. Sometimes, a corrupt ORC file is making the merge job to fail. The job would be to find out the corrupt ORC file and move it into a badrecordspath and then let the spark merge job begin. After going through the theory of ORC file, it seems a valid ORC file will have "ORC"(as a string) followed by another byte in the end of the file. How do I check that in optimised way so that it won't take much time to validate those 10K orc files. I thought of writing bash shell script but it seems to take some good amount of time to validate HDFS orc files. My idea is to narrow down the validation if I know the minimum size of a valid ORC file coz most of our corrupt files are very tiny in size(mostly 3 bytes). So if I get any suggestion, that would be very helpful.
PS: I can't use set spark.sql.files.ignoreCorruptFiles=true because I have to track the files and move those to bad records path.

kiz8lqtg

kiz8lqtg1#

找到了一个解决方案。我们可以使用set spark.sql.files.ignoreCorruptFiles=true,然后我们可以使用下面的方法跟踪被忽略的文件:

def trackIgnoreCorruptFiles(df: DataFrame): List[Path] = {

    val listOfFileAfterIgnore = df.withColumn("file_name", input_file_name)
      .select("file_name")
      .distinct()
      .collect()
      .map(x => new Path(x(0).toString))
      .toList

 
    listOfCompleteFiles.diff(listOfFileAfterIgnore)
  }

***input_file_name***是一个内置的spark udf,它返回文件的完整路径,我们将其作为 Dataframe df中的一列获取。此方法返回被spark忽略后保留的文件的路径列表。列表diff将为您提供被spark忽略的文件的实际列表。然后,您可以轻松地将这些文件列表移动到badRecordsPath以供将来分析。

相关问题