pyspark.pandas API -如何将列表中的列分隔成多个列？

5ssjco0h 于 2022-11-01 发布在 Spark

关注(0)|答案(1)|浏览(138)

我试图在我的Databricks笔记本中将列表为[599086.9706961295, 4503107.843920314]的列分成两列（“x”和“y”）。
在我的Jupyter笔记本中，列是这样分开的：


# code from my jupter notebook

# column with list in it is: xy

# Method 1

complete[['x', 'y']] = pd.Series(np.stack(complete['xy'].values).T.tolist())

# column is also getting separated using this method

# Method 2

def sepXY(xy):
    return xy[0],xy[1]

complete['x'],complete['y'] = zip(*complete['xy'].apply(sepXY))

在我的Databricks记事本中，出现错误：
两种方法我都试过了

import pyspark.pandas as ps

# Method 1

complete[['x', 'y']] = ps.Series(np.stack(complete['xy'].values).T.tolist())

判断提示错误：
如果我只运行ps.Series(np.stack(complete['xy'].values).T.tolist())，我将得到包含x和y的两个列表的输出

0    [599086.9706961295, 599079.1456765212, 599059....
1    [4503107.843920314, 4503083.465809557, 4503024...

但是当我把它赋值给complete[['x','y']]时，它抛出了错误。


# Method 2

def sepXY(xy):
    return xy[0],xy[1]

complete['x'],complete['y'] = zip(*complete['xy'].apply(sepXY))

箭头无效：无法使用类型元组转换（599086.9706961295，4503107.843920314）：在推断Arrow数据类型时无法识别Python值类型
我检查了数据类型，它不是元组
我也试过

complete[['x','y']] = pd.DataFrame(complete.xy.tolist(), index= complete.index)

如果我使用这个，我的核心会重新启动


# This is the column for sample

xy
[599086.9706961295, 4503107.843920314]
[599088.5389507986, 4503112.7796745915]
[599072.8088083105, 4503064.139248001]
[599090.0996424126, 4503117.721156018]
[599074.3909188313, 4503068.925677084]

pyspark

来源：https://stackoverflow.com/questions/74015868/pyspark-pandas-api-how-to-seperate-column-with-list-into-multiple-columns

1条答案

按热度按时间

np8igboo1#

输入：

complete = spark.createDataFrame(
    [([599086.9706961295, 4503107.843920314],),
     ([599088.5389507986, 4503112.7796745915],),
     ([599072.8088083105, 4503064.139248001],),
     ([599090.0996424126, 4503117.721156018],),
     ([599074.3909188313, 4503068.925677084],)],
    ['xy']
).pandas_api()

在上面的例子中，可以这样做：

complete['x'] = complete['xy'].apply(lambda x: x[0])
complete['y'] = complete['xy'].apply(lambda x: x[1])

print(complete)

# xy              x             y

# 0   [599086.9706961295, 4503107.843920314]  599086.970696  4.503108e+06

# 1  [599088.5389507986, 4503112.7796745915]  599088.538951  4.503113e+06

# 2   [599072.8088083105, 4503064.139248001]  599072.808808  4.503064e+06

# 3   [599090.0996424126, 4503117.721156018]  599090.099642  4.503118e+06

# 4   [599074.3909188313, 4503068.925677084]  599074.390919  4.503069e+06

print(complete.dtypes)

# xy     object

# x     float64

# y     float64

# dtype: object

赞(0）回复(0）举报 2022-11-01

我来回答

pyspark.pandas API -如何将列表中的列分隔成多个列？

1条答案

相关问题

热门标签

最新问答