在导入之前是否可以在新文件上编写一个带有过滤器的sqoop增量导入？

kyxcudwk 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(350)

我的疑问是，比如说，我在sql server表中有一个a1.csv文件，其中有2000条记录，我将这些数据导入到hdfs中，当天晚些时候我在sql server表的同一个文件中添加了3000条记录。现在，我想为要添加到hdfs上的第二块数据运行增量导入，但是，我不想导入完整的3000条记录。我只需要根据我的需要导入一些数据，比如，作为增量导入的一部分导入1000条具有特定条件的记录。
有没有办法使用sqoop incremental import命令来实现这一点？
请帮忙，谢谢。

hadoop hdfs sqoop merge

来源：https://stackoverflow.com/questions/48556141/is-it-possible-to-write-a-sqoop-incremental-import-with-filters-on-the-new-file

1条答案

按热度按时间

ttcibm8c1#

您需要一个唯一的键或时间戳字段来标识delta，它是您案例中新的1000条记录。使用该字段，您必须选择将数据引入hadoop的选项。
方案1
是使用sqoop的增量追加，下面是它的例子

sqoop import \
--connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
--username wzhou \
--password wzhou \
--table STUDENT \
--incremental append \
--check-column student_id \
-m 4 \
--split-by major

论据：

--check-column (col)  #Specifies the column to be examined when determining which rows to import.

--incremental (mode)      #Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.

--last-value (value) Specifies the maximum value of the check column from the previous import.

方案2
使用 --query 参数，其中可以使用mysql/连接到的任何数据库的本机sql。
例子：

sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  --split-by a.id --target-dir /user/foo/joinresults

sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  -m 1 --target-dir /user/foo/joinresults

赞(0）回复(0）举报 2021-05-29

我来回答

在导入之前是否可以在新文件上编写一个带有过滤器的sqoop增量导入？

1条答案

相关问题

热门标签

最新问答