Pandas合并具有不同列的两个 Dataframe

jjhzyzn0 于 2022-12-21 发布在其他

关注(0)|答案(3)|浏览(269)

这里我肯定漏掉了一些简单的东西，试图合并两个 Dataframe ，它们在Pandas中有几乎相同的列名，但是右边的 Dataframe 有一些左边没有的列，反之亦然。

>df_may

  id  quantity  attr_1  attr_2
0  1        20       0       1
1  2        23       1       1
2  3        19       1       1
3  4        19       0       0

>df_jun

  id  quantity  attr_1  attr_3
0  5         8       1       0
1  6        13       0       1
2  7        20       1       1
3  8        25       1       1

我尝试过使用外部联接进行联接：

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")

但这会产生：

Left data columns not unique: Index([....

我还指定了一个要连接的列（例如on = "id"），但这会复制除id之外的所有列，如attr_1_x、attr_1_y，这并不理想，我还将整个列列表（有很多列）传递给on：

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))

其结果为：

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

我错过了什么？我想得到一个所有行都被附加的df，attr_1，attr_2，attr_3在可能的地方被填充，NaN在它们没有出现的地方。这看起来像是一个非常典型的数据处理工作流，但是我卡住了。

pandas

来源：https://stackoverflow.com/questions/28097222/pandas-merge-two-dataframes-with-different-columns

3条答案

按热度按时间

uoifb46i1#

我认为在本例中concat就是您想要的：

In [12]:

pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
   attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

通过在这里传递axis=0，您将df堆叠在彼此的顶部，我相信这是您想要的，然后生成NaN值，其中它们不在各自的df中。

赞(0）回复(0）举报 2022-12-21

y53ybaqx2#

接受的答案将打破if there are duplicate headers：
无效索引错误：重新建立索引仅对值唯一的Index对象有效。
例如，这里A有3x个trial列，这将防止concat：

A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
#    id  trial  trial  trial
# 0   3      1      4      1

B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
#    id  trial
# 0   5      9
# 1   2      6

pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects

要解决此问题，请在concat之前使用deduplicate the column names：

parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})

for df in [A, B]:
    df.columns = parser._maybe_dedup_names(df.columns) 

pd.concat([A, B], ignore_index=True)
#    id  trial  trial.1  trial.2
# 0   3      1        4        1
# 1   5      9      NaN      NaN
# 2   2      6      NaN      NaN

或者作为一行程序，但可读性较差：

pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)

注意对于panda〈1.3.0，用途：parser = pd.io.parsers.ParserBase({})

赞(0）回复(0）举报 2022-12-21

jk9hmnmh3#

今天我在使用concat、append或merge时遇到了这个问题，我通过添加一个按顺序编号的helper列，然后执行一个外部连接来解决这个问题

helper=1
for i in df1.index:
    df1.loc[i,'helper']=helper
    helper=helper+1
for i in df2.index:
    df2.loc[i,'helper']=helper
    helper=helper+1
df1.merge(df2,on='helper',how='outer')

赞(0）回复(0）举报 2022-12-21

我来回答

Pandas合并具有不同列的两个 Dataframe

3条答案

相关问题

热门标签

最新问答