The scenario I am dealing with here is each hour 10k orc files are getting generated in HDFS by spark streaming application and after the end of the hour, a spark merge job runs and merge those small files in some bigger chunk and write it to hive landing path for external table to pick up. Sometimes, a corrupt ORC file is making the merge job to fail. The job would be to find out the corrupt ORC file and move it into a badrecordspath and then let the spark merge job begin. After going through the theory of ORC file, it seems a valid ORC file will have "ORC"(as a string) followed by another byte in the end of the file. How do I check that in optimised way so that it won't take much time to validate those 10K orc files. I thought of writing bash shell script but it seems to take some good amount of time to validate HDFS orc files. My idea is to narrow down the validation if I know the minimum size of a valid ORC file coz most of our corrupt files are very tiny in size(mostly 3 bytes). So if I get any suggestion, that would be very helpful.
PS: I can't use set spark.sql.files.ignoreCorruptFiles=true because I have to track the files and move those to bad records path.
1条答案
按热度按时间kiz8lqtg1#
找到了一个解决方案。我们可以使用set spark.sql.files.ignoreCorruptFiles=true,然后我们可以使用下面的方法跟踪被忽略的文件:
***input_file_name***是一个内置的spark udf,它返回文件的完整路径,我们将其作为 Dataframe df中的一列获取。此方法返回被spark忽略后保留的文件的路径列表。列表diff将为您提供被spark忽略的文件的实际列表。然后,您可以轻松地将这些文件列表移动到badRecordsPath以供将来分析。