如何在pyspark中按列名Map值

6qqygrtg  于 2021-06-25  发布在  Hive
关注(0)|答案(2)|浏览(458)

我想要的是将列名Map到键中。例如:


# +-------+----------+

# |key1   |key2      |

# +-------+----------+

# |value1 |value2    |

# |value3 |value4    |

# +-------+----------+

将转化为


# +-------+----------+

# |   keys|values    |

# +-------+----------+

# |key1   |value1    |

# |key1   |value2    |

# |key2   |value3    |

# |key2   |value4    |

# +-------+----------+

在hiveql中,我可以编写类似于

select distinct key, velue
    from xxx
    lateral view explode(map(
            'key1', key1,
            'key2', key2) tab as key, value

但是如何在Pypark写。我可以使用createtentable,但我认为这不是最好的解决方案/

sycxhyv7

sycxhyv71#

像这样的?

select 'key1' as keys,
       key1 as values
from xxx
union all 
select 'key2' as keys,
       key2 as values
from xxx

把它放好 spark.sql() .

tjvv9vkg

tjvv9vkg2#

使用 create_map 函数创建一个Map列,然后将其分解。 create_map 需要按键值对分组的列表达式列表。可以使用创建这样的列表,以便理解Dataframe列:

from itertools import chain
from pyspark.sql.functions import col, lit, create_map, explode

data = [("value1", "value2"), ("value3", "value4")]
df = spark.createDataFrame(data, ["key1", "key2"])

key_values = create_map(*(chain(*[(lit(name), col(name)) for name in df.columns])))

df.select(explode(key_values)).show()

+----+------+
| key| value|
+----+------+
|key1|value1|
|key2|value2|
|key1|value3|
|key2|value4|
+----+------+

相关问题