Usecase is to load local file into HDFS. Below two are approaches to do the same , Please suggest which one is efficient.
Approach1: Using hdfs put command
hadoop fs -put /local/filepath/file.parquet /user/table_nm/
Approach2: Using Spark .
spark.read.parquet("/local/filepath/file.parquet ").createOrReplaceTempView("temp")
spark.sql(s"insert into table table_nm select * from temp")
Note:
- Source File can be in any format
- No transformations needed for file loading .
- table_nm is an hive external table pointing to /user/table_nm/
1条答案
按热度按时间ffx8fchx1#
假设它们已经是本地构建的.parquet文件,使用-put会更快,因为没有启动Spark App的开销。
如果有很多文件,那么通过put要做的工作就更少了。