sqoop将postgres导入s3失败

2izufjch  于 2021-06-03  发布在  Sqoop
关注(0)|答案(5)|浏览(515)

我正在将postgres数据导入hdfs。我计划将存储从hdfs移到s3。当我试图提供s3位置时,sqoop作业失败了。我在emr(emr-5.27.0)集群上运行它,并且我已经从集群中的所有节点对s3存储桶进行了读/写访问。

sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile

例外是,

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/10/21 09:27:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/10/21 09:27:33 INFO manager.SqlManager: Using default fetchSize of 1000
19/10/21 09:27:33 INFO tool.CodeGenTool: Beginning code generation
19/10/21 09:27:33 INFO tool.CodeGenTool: Will generate java class as codegen_addresses
19/10/21 09:27:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/10/21 09:27:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.jar
19/10/21 09:27:37 WARN manager.PostgresqlManager: It looks like you are importing from postgresql.
19/10/21 09:27:37 WARN manager.PostgresqlManager: This transfer can be faster! Use the --direct
19/10/21 09:27:37 WARN manager.PostgresqlManager: option to exercise a postgresql-specific fast path.
19/10/21 09:27:37 INFO mapreduce.ImportJobBase: Beginning import of addresses
19/10/21 09:27:37 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml
19/10/21 09:27:39 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
    at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:128)
    at org.kitesdk.data.Datasets.exists(Datasets.java:624)
    at org.kitesdk.data.Datasets.exists(Datasets.java:646)
    at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:118)
    at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:132)
    at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:264)
    at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
    at org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:127)
    at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:520)
    at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:252)

注意:与hdfs target dir相同的sqoop命令正在工作。我还可以从集群节点手动写入s3 bucket(使用awss3命令)。

7gcisfzg

7gcisfzg1#

您需要将目标目录协议从 s3s3a :

sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3a://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
dgjrabp2

dgjrabp22#

Parquet有两种方法-
使用 --as-parquetfile 使用 HCatalog 但这两种方式,它不可能sqoop直接在EMR5.x的Parquet地板
两种方法都有问题-
sqoop使用kite-sdk来读/写parquet,它有一些局限性。而且不可能使用 --as-parquetfile . 根据aws支持人员的指示,emr将在将来删除kite-sdk
支撑Parquet地板 HCatalog 已为hive(v2.4.0、v2.3.7)jira卡和hive(v3.0.0)jira卡添加。但是emr5.x使用的是hive版本2.3.5。
到目前为止,在emr(v5.x)中有哪些解决方法:
使用中间体 text 表来提取数据。使用单独的配置单元查询将数据从文本复制到所需的parquet表。

yx2lnoni

yx2lnoni3#

请尝试执行以下指定的sqoop命令:

sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--num-mappers 100 \
--split-by id \
--table addresses \
--as-parquetfile \
--target-dir s3://my-bucket/data/temp

确保目标目录在s3中不存在

7gcisfzg

7gcisfzg4#

kite sdk已升级。您所要做的就是将新的sdk下载到emr中,然后再次运行sqoop命令。
使用wget下载kite-data-s3-1.1.0.jar

wget https://repo1.maven.org/maven2/org/kitesdk/kite-data-s3/1.1.0/kite-data-s3-1.1.0.jar

将jar移动到sqoop库目录(/usr/lib/sqoop/lib/)
sudo cp kite-data-s3-1.1.0.jar/usr/lib/sqoop/lib/
授予jar权限
sudo chmod 755风筝-data-s3-1.1.0.jar
使用s3n连接器导入jar
sqoop导入--连接“jdbc:postgresql://:/?sslfactory=org.postgresql.ssl.nonvalidatingfactory&ssl=true“--用户名--密码文件--表地址--目标目录s3n://my bucket/data/temp --num mappers 100 --按id拆分--作为parquetfile
字体:https://aws.amazon.com/premiumsupport/knowledge-center/unknown-dataset-uri-pattern-sqoop-emr/

ef1yzkbh

ef1yzkbh5#

@makubex,我在添加s3a作为uri模式后可以导入,
但是进口工作花费的时间太长了。
我使用的是emr 5.26.0。我需要做任何配置更改来提高时间吗?

相关问题