删除配置单元联接中的重复联接列

but5z9lq  于 2021-06-26  发布在  Hive
关注(0)|答案(3)|浏览(465)

我正在Hive中执行连接:

  1. select * from
  2. (select * from
  3. (select * from A join B on A.x = B.x) t1
  4. join C on t1.y = C.y) t2
  5. join D on t2.x = D.x

我得到的x列无法解析,因为a和b都包含x列。我应该如何使用限定名,或者是否有方法删除配置单元中的重复列。

iq3niunx

iq3niunx1#

我遇到了完全相同的问题,解决方案是通过使用修改后的模式重新创建dataframe来重命名重复的列。下面是一些示例代码:

  1. def renameDuplicatedColumns(df: DataFrame): DataFrame = {
  2. val duplicatedColumns = df.columns
  3. .groupBy(identity)
  4. .filter(_._2.length > 1)
  5. .keys
  6. .toSet
  7. val newIndexes = mutable.Map[String, Int]().withDefaultValue(0)
  8. val schema: StructType = StructType(
  9. df.schema
  10. .collect {
  11. case field if duplicatedColumns.contains(field.name) =>
  12. val idx = newIndexes(field.name)
  13. newIndexes.update(field.name, idx + 1)
  14. field.copy(name = field.name + "__" + idx)
  15. case field =>
  16. field
  17. }
  18. )
  19. df.sqlContext.createDataFrame(df.rdd, schema)
  20. }
展开查看全部
6kkfgxo0

6kkfgxo02#

您可以执行类似于下面的操作,但这意味着您不能在列名中使用特殊字符。

  1. set hive.support.quoted.identifiers=none;
  1. select * from
  2. (select C.*,t1.`(y)?+.+` from
  3. (select A.*,B.`(x)?+.+` from A join B on A.x = B.x) t1
  4. join C on t1.y = C.y) t2
  5. join D on t2.x = D.x

https://cwiki.apache.org/confluence/display/hive/languagemanual+select#languagemanualselect-regexColumn规范

km0tfn4u

km0tfn4u3#

因为表a和表b有x列,所以必须在此select中为此列指定一个别名

  1. select * from A join B on A.x = B.x

像这样的

  1. select A.x as x1, B.x as x2, ...
  2. from A join B on A.x = B.x

相关问题