我们正在尝试使用sqoop将数据从hdfs导出到mysql,并面临以下问题。
样本数据:
4564,38,153,2013-05-30 10:40:42.767,false,No credentials attempted,,,00 00 00 00 01 64 e6 a6
4565,38,160,2013-05-30 10:40:42.767,false,No credentials attempted,,,00 00 00 00 01 64 e6 a7
4566,38,80,2013-03-07 12:16:26.03,false,No SSH or Telnet credentials available. If an HTTP(S) exists for this asset, it was not able to authenticate.,,,00 00 00 00 01 0f c7 e6
遵循sqoop程序,我们将数据从hdfs导出到mysql,并在表中指定了架构:
public static void main(String[] args) {
String[] str = { "export", "--connect", "jdbc:mysql://-------/test",
"--table", "status", "--username", "root", "--password", "******",
"--export-dir", "hdfs://-----/user/hdfs/InventoryCategoryStatus/",
"--input-fields-terminated-by", ",", "--input-lines-terminated-by", "\n"
};
Sqoop.runTool(str);
}
程序执行后出错:
[exec:exec]
0 [main] WARN org.apache.sqoop.tool.SqoopTool - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
123 [main] WARN org.apache.sqoop.tool.BaseSqoopTool - Setting your password on the command-line is insecure. Consider using -P instead.
130 [main] WARN org.apache.sqoop.ConnFactory - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
Note: /tmp/sqoop-manish/compile/fd0060344195ec9b06030b84cdf6e243/status.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
9516 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
11166 [main] WARN org.apache.hadoop.conf.Configuration - mapred.jar is deprecated. Instead, use mapreduce.job.jar
16598 [main] WARN org.apache.hadoop.conf.Configuration - mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16612 [main] WARN org.apache.hadoop.conf.Configuration - mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
16614 [main] WARN org.apache.hadoop.conf.Configuration - mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16618 [main] WARN org.apache.sqoop.mapreduce.JobBase - SQOOP_HOME is unset. May not be able to find all job dependencies.
17074 [main] WARN org.apache.hadoop.conf.Configuration - session.id is deprecated. Instead, use dfs.metrics.session-id
17953 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files
17956 [main] WARN org.apache.hadoop.conf.Configuration - mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
17957 [main] WARN org.apache.hadoop.conf.Configuration - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
17958 [main] WARN org.apache.hadoop.conf.Configuration - mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
17959 [main] WARN org.apache.hadoop.conf.Configuration - mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
17959 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.name is deprecated. Instead, use mapreduce.job.name
17959 [main] WARN org.apache.hadoop.conf.Configuration - mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
17960 [main] WARN org.apache.hadoop.conf.Configuration - mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
17960 [main] WARN org.apache.hadoop.conf.Configuration - mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
17960 [main] WARN org.apache.hadoop.conf.Configuration - mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
17961 [main] WARN org.apache.hadoop.conf.Configuration - mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
17961 [main] WARN org.apache.hadoop.conf.Configuration - mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
19283 [main] WARN org.apache.hadoop.mapred.LocalDistributedCacheManager - LocalJobRunner does not support symlinking into current working dir.
19312 [main] WARN org.apache.hadoop.conf.Configuration - mapred.cache.localFiles is deprecated. Instead, use mapreduce.job.cache.local.files
20963 [Thread-29] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
java.lang.Exception: java.lang.NumberFormatException: For input string: " it was not able to authenticate."
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:400)
Caused by: java.lang.NumberFormatException: For input string: " it was not able to authenticate."
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:481)
at java.lang.Integer.valueOf(Integer.java:582)
at status.__loadFromFields(status.java:412)
at status.parse(status.java:334)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:77)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:36)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:183)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:232)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
21692 [main] WARN mapreduce.Counters - Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
21698 [main] WARN mapreduce.Counters - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
21699 [main] ERROR org.apache.sqoop.tool.ExportTool - Error during export: Export job failed!
------------------------------------------------------------------------
BUILD SUCCESS
------------------------------------------------------------------------
Total time: 30.419s
Finished at: Fri Aug 23 15:28:03 IST 2013
Final Memory: 14M/113M
之后,我们检查了mysql表,它只包含1600条记录中的100条。同样的程序,我们在另一个表上执行,然后在8000条记录中,只有6800条记录和376927条记录中的235202条记录被导出到mysql表中。能否请任何人提供一些关于上述程序执行错误的建议。
期待回复,非常感谢您的帮助。
2条答案
按热度按时间u3r8eeie1#
您似乎有一个字符串,其中的数字应该是“it was not able to authentication.”(从您共享的跟踪中可以看到)。请检查推送到数据库的源数据。
编辑
使用其他字符作为分隔符。将数据写入hdfs时(我假设mr程序正在生成此数据),请使用罕见字符(如^a,#,@)作为分隔符。
“export”命令中有多种选项,如“--enclosed by”、“转义by”。但是你的数据应该相应地准备好。最简单的选择似乎是选择一个极不可能出现在数据中的分隔符。
edit-2在这种情况下,任何工具都无能为力,因为分隔符在数据字段之间,没有任何转义符,也没有封闭字符串(比如“你好,你好”)。存储时需要控制数据。因此,在通过flume提取时,您应该使用不同于“,”或escape“,”(如“hello\,how are you”)字符的分隔符,或者能够将每个字段括起来(“hello,how are you”)。
所以您应该在通过flume提取和存储数据时实现这一点。你应该研究一下Flume上是否有实现这些目标的方法。
或者,您可以编写一个mr程序来清理或过滤问题记录(单独处理),或者将数据加载到mysql上的一个staging表中,并编写一个sp来处理问题记录场景并插入到目标表中。
bpzcxfmw2#
看看您的示例,似乎您使用逗号作为列(字段)分隔符,但是您允许逗号作为数据本身的一部分。注意示例数据的第三行:
第6列(无ssh…)中包含逗号。结果,这一列将被sqoop拆分为两个不同的列,因此会得到异常。我建议你把资料整理一下。如果您使用sqoop将它们导入到hdfs中,那么可以使用参数(用或括起来)来解决这个问题。