union

cmssoen2  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(278)

我正在尝试在两个列数不同的表之间执行unionall。
数据样本
电影功能可能性20.379860.5023140.835210.292560.99250.706280.3270.764
MovieidTitleGenres1电视故事(1995)冒险,动画,儿童2朱曼吉(1995)儿童3等待呼气(1995)喜剧,戏剧,浪漫4新娘之父第二部分(1995)喜剧5萨布丽娜(1995)浪漫,喜剧
根目录:
文件1

|-- MovieIdId: integer (nullable = false)
   |-- features: Double (nullable = false)
   |-- probability: Double (nullable = false)

文件2

|--MovieId: integer (nullable = false)
   |--title: String (nullable = false)
   |-- genres: String (nullable = false)

代码:`

`val Data = sc.textFile("file:///usr/local/File1").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim, p(2).trim)).toDF()
Data.registerTempTable("Genres")
val Data2 = sc.textFile("file:///usr/local/File2").map(_.split(",")).map(p =>Movie(p(0).trim.toInt, p(1).trim, p(1).trim)).toDF()
Data.registerTempTable("Tags")
val df=sqlContext.sql("SELECT MovieId, title, genres FROM Genres UNION ALL SELECT MovieId, features, probability FROM Tags")

更新:
我想做一个表格,里面有电影ID,特征,概率,类型。
如何合并两个表中的两列而不重复?

mbzjlibv

mbzjlibv1#

我想你想要一个完整的外部连接而不是联合:

val df = Data.join(Data2, Seq("Movieid"), "full")

df.show
//+-------+--------+-----------+----------------------------------+----------------------------+
//|Movieid|features|probability|title                             |genres                      |
//+-------+--------+-----------+----------------------------------+----------------------------+
//|1      |null    |null       |Toy Story (1995)                  |Adventure,Animation,children|
//|6      |0.9925  |0.7062     |null                              |null                        |
//|3      |null    |null       |Waiting to Exhale (1995)          |comedy,Drama,Romance        |
//|5      |null    |null       |Sabrina (1995)                    |Romance,comedy              |
//|4      |0.83521 |0.2925     |Father of the Bride Part II (1995)|comedy                      |
//|8      |0.327   |0.764      |null                              |null                        |
//|2      |0.37986 |0.50231    |Jumanji (1995)                    |Children                    |
//+-------+--------+-----------+----------------------------------+----------------------------+

或使用sql:

val df = spark.sql("""
  select *
  from Genres g
  full outer join Tags t
  on g.Movieid = t.Movieid
""")

相关问题