删除配置单元联接中的重复联接列

but5z9lq  于 2021-06-26  发布在  Hive
关注(0)|答案(3)|浏览(445)

我正在Hive中执行连接:

select * from
  (select * from 
      (select * from A join B on A.x = B.x) t1
  join C on t1.y = C.y) t2
join D on t2.x = D.x

我得到的x列无法解析,因为a和b都包含x列。我应该如何使用限定名,或者是否有方法删除配置单元中的重复列。

iq3niunx

iq3niunx1#

我遇到了完全相同的问题,解决方案是通过使用修改后的模式重新创建dataframe来重命名重复的列。下面是一些示例代码:

def renameDuplicatedColumns(df: DataFrame): DataFrame = {
    val duplicatedColumns = df.columns
      .groupBy(identity)
      .filter(_._2.length > 1)
      .keys
      .toSet
    val newIndexes = mutable.Map[String, Int]().withDefaultValue(0)

    val schema: StructType = StructType(
      df.schema
        .collect {
          case field if duplicatedColumns.contains(field.name) =>
            val idx = newIndexes(field.name)
            newIndexes.update(field.name, idx + 1)
            field.copy(name = field.name + "__" + idx)
          case field =>
            field
        }
    )
    df.sqlContext.createDataFrame(df.rdd, schema)
  }
6kkfgxo0

6kkfgxo02#

您可以执行类似于下面的操作,但这意味着您不能在列名中使用特殊字符。

set hive.support.quoted.identifiers=none;
select * from
  (select C.*,t1.`(y)?+.+` from 
      (select A.*,B.`(x)?+.+` from A join B on A.x = B.x) t1
  join C on t1.y = C.y) t2
join D on t2.x = D.x

https://cwiki.apache.org/confluence/display/hive/languagemanual+select#languagemanualselect-regexColumn规范

km0tfn4u

km0tfn4u3#

因为表a和表b有x列,所以必须在此select中为此列指定一个别名

select * from A join B on A.x = B.x

像这样的

select A.x as x1, B.x as x2, ...
from A join B on A.x = B.x

相关问题