使用spark java将Dataframe写入外部分区配置单元表时发生spark异常

fcy6dtqo  于 2021-06-25  发布在  Hive
关注(0)|答案(0)|浏览(317)

我正在多节点集群上运行一个spark作业,并尝试将一个Dataframe插入(附加)到一个外部配置单元表中,该表由两列(date和hr)分区。

dataframe.write().insertInto(hiveTable);

配置单元表结构如下:

CREATE EXTERNAL TABLE `database.hiveTable`(
  `col1` string,
  `col2` string,
  `col3_json` string,
)
PARTITIONED BY (
  `dt` string,
  `hr` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION '/data/hdfs/tmp/test';

注意:col3\u json列将包含json字符串中的数据,如:

{"group":[{"action":"Change","gid":"111","isId":"Y"},{"action":"Add","gid":"111","isId":"Y"},{"action":"Delete","gid":"111","isId":"N"}]}

当表未分区时,数据被成功插入。但是当数据插入到上面的分区表中时,它抛出了以下错误:

org.apache.spark.SparkException: Task failed while writing rows.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
.
.
.
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$PathComponentTooLongException): The maximum path component name limit of hr=%7B%22group%22%7B%22action%22%3A%22Change%22,%22gid%22%22,%22isId%22%3A%22Y%22},%7B%22action%22Add%22,%22gid%111%22,%22isId%%22},%7B%22action%22%3A%22Delete%22,%22gid%2524%22,%22isId%22N%22}%5D} in directory /data/hdfs/tmp/test/.hive-staging_hive_2020-01-04_00-27-05_24_76879687968796-1/-ext-10000/_temporary/0/_temporary/attempt_20200104002705_0027_m_000000_0/dt=N is exceeded: limit=255 length=399
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxComponentLength(FSDirectory.java:1113)
我注意到这个错误在json数据中有几个字符串,比如:group、change、gid等。不确定这是否与插入到col3\ujson的json数据有关。
请建议。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题