我知道hive/hadoop不适用于update/delete,但我的要求是基于person21表的数据更新person20表。随着兽人Hive的发展,它支持酸性物质,但看起来还不成熟。
$ hive --version
Hive1.1.0-cdh5.6.0
下面是我为测试更新逻辑而执行的详细步骤。
CREATE TABLE person20(
persid int,
lastname string,
firstname string)
CLUSTERED BY (
persid)
INTO 1 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://hostname.com:8020/user/hive/warehouse/person20'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='3',
'numRows'='2',
'rawDataSize'='348',
'totalSize'='1730',
'transactional'='true',
'transient_lastDdlTime'='1489668385')
插入语句:
INSERT INTO TABLE person20 VALUES (0,'PP','B'),(2,'X','Y');
select语句:
set hive.cli.print.header=true;
select * from person20;
persid lastname firstname
2 X Y
0 PP B
我还有一张table是person20的复制品,即person21:
CREATE TABLE person21(
persid int,
lastname string,
firstname string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://hostname.com:8020/user/hive/warehouse/person21'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'numRows'='2',
'rawDataSize'='11',
'totalSize'='13',
'transient_lastDdlTime'='1489668344')
插入语句:
INSERT INTO TABLE person20 VALUES (0,'SS','B'),(2,'X','Y');
select语句:
select * from person21;
persid lastname firstname
2 X1 Y
0 SS B
我想实现合并逻辑:
Merge into person20 p20 USING person21 p21
ON (p20.persid=p21.persid)
WHEN MATCHED THEN
UPDATE set p20.lastname=p21.lastname
但是merge不能与我的hivehive1.1.0-cdh5.6.0版本一起使用。这将从Hive2.2开始提供。
另一个选项是相关子查询update:-
hive -e "set hive.auto.convert.join.noconditionaltask.size = 10000000; set hive.support.concurrency = true; set hive.enforce.bucketing = true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1 ; UPDATE person20 SET lastname = (select lastname from person21 where person21.lastname=person20.lastname);"
这份声明给出了以下结论error:-
使用中的配置初始化日志记录jar:file:/usr/lib/hive/lib/hive-common-1.1.0-cdh5.6.0.jar/hive-log4j.properties noviableexception(224@[400:1:precedenceequalexpression:((left=precedencebitwiseorexpression->$left)((kw\u not precedenceequalnegatableoperator notexpr=precedencebitwiseorexpression)->^(kw\u not^(precedencequalnegatableoperator$precedenceequalexpression$notexpr));(precedencequaloperator equalexpr=precedencebitwiseorexpression)->^(precedenceequaloperator$precedenceequalexpression$equalexpr)|(kw\u not kw\u in lparen kw\u select)=>(kw\u not kw\u in subqueryexpression)->^(kw\u not^(tok\u subquery\u expr^(tok\u subquery\u op kw\u in)subqueryexpression$precedenceequalexpression)|(kw\u not kw\u in expression)->^(kw\u not^(tok\u函数)kw_in$precedenceequalexpression expression))|(kw_in lparen kw_select)=>(kw_in subqueryexpression)->^(tok_subquery_expr^(tok_subquery_op kw_in)subqueryexpression$precedenceequalexpression)|(kw_in expression)->^(tok_function kw u in$precedenceequalexpression expression)|(kw_not kw u between(min=precedencebitwiseorexpression)kw\u和(max=precedencebitwiseorexpression))->^(tok\u函数标识符[“between”]kw\u true$left$min$max)|(kw\u between(min=precedencebitwiseorexpression)kw\u和(max=precedencebitwiseorexpression))->^(tok\u函数标识符[“between”]kw\u false$left$min$max))*|(kw\u exists lparen kw\u select)=>(kw_exists subqueryexpression)->^(tok_subquery_expr^(tok_subquery_op kw_exists)subqueryexpression));])在org.antlr.runtime.dfa.noviablelt(dfa。java:158)在org.antlr.runtime.dfa.predict(dfa。java:116)在org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.precenceequalexpression(hiveparser\u identifiersparser。java:8651)在org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.precencenotexpression(hiveparser\u identifiersparser。java:9673)在org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.precendexpression(hiveparser\u identifiersparser。java:9792)在org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.prevenceorexpression(hiveparser\u identifiersparser。java:9951)位于org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.expression(hiveparser\u identifiersparser.expression)。java:6567)在org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.atomexpression(hiveparser\u identifiersparser。java:6791)在org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.precencefieldexpression(hiveparser\u identifiersparser。java:6862)在org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.precenceUnaryPrefixExpression(hiveparser\u identifiersparser)。java:7247)在org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.precenceunarysuffixexpression(hiveparser\u identifiersparser。java:7307)在org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.precencebitwisexorexpression(hiveparser\u identifiersparser)。java:7491)位于org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.precedencestarexpression(hiveparser\u identifiersparser)。java:7651)在org.apache.hadoop.hive.ql.parse.hiveparser\u identifiersparser.precenceplusexpression(hiveparser\u identifiersparser。java:7811)在org.apache.hadoop.hive.ql.parse.hiveparser.precedenceplusexpression(hiveparser。java:44550)在org.apache.hadoop.hive.ql.parse.hiveparser.columnassignmentclause(hiveparser。java:44206)在org.apache.hadoop.hive.ql.parse.hiveparser.setcolumnsclause(hiveparser。java:44271)在org.apache.hadoop.hive.ql.parse.hiveparser.updatestatement(hiveparser。java:44417)在org.apache.hadoop.hive.ql.parse.hiveparser.execstatement(hiveparser。java:1616)在org.apache.hadoop.hive.ql.parse.hiveparser.statement(hiveparser。java:1062)在org.apache.hadoop.hive.ql.parse.parsedriver.parse(parsedriver。java:201)在org.apache.hadoop.hive.ql.parse.parsedriver.parse(parsedriver。java:166)在org.apache.hadoop.hive.ql.driver.compile(driver。java:404)在org.apache.hadoop.hive.ql.driver.compile(driver。java:305)位于org.apache.hadoop.hive.ql.driver.compileinternal(驱动程序。java:1119)在org.apache.hadoop.hive.ql.driver.runinternal(驱动程序。java:1167)在org.apache.hadoop.hive.ql.driver.run(driver。java:1055)在org.apache.hadoop.hive.ql.driver.run(driver。java:1045)在org.apache.hadoop.hive.cli.clidriver.processlocalcmd(clidriver。java:207)在org.apache.hadoop.hive.cli.clidriver.processcmd(clidriver。java:159)在org.apache.hadoop.hive.cli.clidriver.processline(clidriver。java:370)在org.apache.hadoop.hive.cli.clidriver.processline(clidriver。java:305)在org.apache.hadoop.hive.cli.clidriver.executedriver(clidriver。java:702)在org.apache.hadoop.hive.cli.clidriver.run(clidriver。java:675)位于org.apache.hadoop.hive.cli.clidriver.main(clidriver。java:615)在sun.reflect.nativemethodaccessorimpl.invoke0(本机方法)位于sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl)。java:57)在sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl。java:43)在java.lang.reflect.method.invoke(方法。java:606)在org.apache.hadoop.util.runjar.run(runjar。java:221)在org.apache.hadoop.util.runjar.main(runjar。java:136)失败:parseexception行1:33无法识别表达式规范中“select”“lastname”“from”附近的输入
我认为它不支持子查询。同样的语句也适用于常量。
hive -e "set hive.auto.convert.join.noconditionaltask.size = 10000000; set hive.support.concurrency = true; set hive.enforce.bucketing = true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1 ; UPDATE person20 SET lastname = 'PP' WHERE persid = 0;"
--此语句成功地更新了记录。
你能帮我找到在hive中执行dml/合并操作的最佳策略吗。
1条答案
按热度按时间lvmkulzt1#
你可以用暴力做到:
重新创建表
person20
但不是acid,用伪col名称分区,用单个分区表示“dummy”填充
person20
以及person21
创建工作表tmpperson20
完全相同的结构和相同的“虚拟”分区person20
INSERT INTO tmpperson20 PARTITION (dummy='dummy') SELECT p20.persid, p21.lastname, ... FROM person20 p20 JOIN person21 p21 ON p20.persid=p21.persidINSERT INTO tmpperson20 PARTITION (dummy='dummy') SELECT * FROM person20 p20 WHERE NOT EXISTS (select p21.persid FROM person21 p21 WHERE p20.persid=p21.persid)
ALTER TABLE person20 DROP PARTITION (dummy='dummy')ALTER TABLE person20 EXCHANGE PARTITION (dummy='dummy') WITH tmpperson20
现在你可以放下了tmpperson20
不过,用酸的table可能会更棘手,因为有扣子。您还可以尝试使用在游标上迭代并在循环中应用单个更新的过程语言。大量更新效率很低。。。
hpl/sql实用程序是hive2.x附带的,可能安装在hive1.x之上,但我从来没有机会尝试过。而且甲骨文方言在Hive里感觉很奇怪。。。!
或者您可以使用jdbc开发一些定制java代码
ResultSet
和一个PreparedStatement
在一个循环中。