串联PandasDataFrame中的列值,同时忽略NaN

kmbjn2e3  于 2023-01-07  发布在  其他
关注(0)|答案(4)|浏览(161)

我有一个下面的Pandas表
DF:

EVNT_ID col1 col2 col3 col4
 123454   1    Nan   4    5
 628392   Nan   3   Nan   7
 293899   2    Nan  Nan   6
 127820   9    11    12   19

现在,我尝试将除第一列之外的所有列进行concat,并希望数据框的外观如下所示
新定义:

EVNT_ID col1 col2 col3 col4 new_col
 123454   1    Nan   4    5   1|4|5
 628392   Nan   3   Nan   7    3|7
 293899   2    Nan  Nan   6    2|6
 127820   9    11    12   19  9|11|12|19

我正在使用以下代码

df['new_column'] = df[~df.EVNT_ID].apply(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)

但它给出了以下错误
第一个月
如果有人能告诉我我错在哪里,我会非常感激。

vnjpjtjt

vnjpjtjt1#

请尝试以下代码:

df['new_col'] = df.iloc[:, 1:].apply(lambda x:
    '|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)

最初我考虑使用x.dropna()而不是x if str(el) != 'nan',但是%timeit显示dropna()的工作速度要慢得多。

tyg4sfes

tyg4sfes2#

您可以使用filteragg执行此操作:

df.filter(like='col').agg(
    lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)

0         1|4|5
1           3|7
2           2|6
3    9|11|12|19
dtype: object

或者,

df.drop('EVNT_ID', 1).agg(
        lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)

0         1|4|5
1           3|7
2           2|6
3    9|11|12|19
dtype: object

如果性能很重要,可以使用列表解析:

joined = [
    '|'.join([str(int(x)) for x in r if pd.notna(x)]) 
    for r in df.iloc[:,1:].values.tolist()
]
joined
# ['1|4|5', '3|7', '2|6', '9|11|12|19']

df.assign(new_col=joined)   

   EVNT_ID  col1  col2  col3  col4     new_col
0   123454   1.0   NaN   4.0     5       1|4|5
1   628392   NaN   3.0   NaN     7         3|7
2   293899   2.0   NaN   NaN     6         2|6
3   127820   9.0  11.0  12.0    19  9|11|12|19

如果您可以原谅为DataFrame赋值的开销,下面是两个最快解决方案的时间安排。

df = pd.concat([df] * 1000, ignore_index=True)

# In this post.
%%timeit
[
     '|'.join([str(int(x)) for x in r if pd.notna(x)]) 
     for r in df.iloc[:,1:].values.tolist()
]
# RafaelC's answer.
%%timeit
[
    '|'.join([k for k in a if k])
    for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values.tolist())
]    

31.9 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

不过请注意,由于@RafaelC的代码生成浮点数,因此答案并不相同:['1.0|2.0|9.0', '3.0|11.0', ...]。如果这是好的,那么很好。否则你需要转换成int,这会增加更多的开销。

8aqjt8rx

8aqjt8rx3#

使用列表解析和zip

>>> [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]

时机似乎不错

df = pd.concat([df]*1000)

%timeit [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
10.8 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.filter(like='col').agg(lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
1.68 s ± 91.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.iloc[:, 1:].apply(lambda x: '|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
87.8 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.assign(new_col=['|'.join([str(int(x)) for x in r if ~np.isnan(x)]) for r in df.iloc[:,1:].values])
45.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
axzmvihb

axzmvihb4#

import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime

df = pd.DataFrame({
    'date' : ['05/9/2023', '07/10/2023', '08/11/2023', '06/12/2023'],
    'A' : [1,  np.nan,4, 7],
    'B' : [2, np.nan, 5, 8],
    'C' : [3, 6, 9, np.nan]
}).set_index('date')

print(df)

print('.........')

start_time = datetime.now()
df['ColumnA'] = df[df.columns].agg(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)
print(df['ColumnA'])

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

"""
              A    B    C
date                     
05/9/2023   1.0  2.0  3.0
07/10/2023  NaN  NaN  6.0
08/11/2023  4.0  5.0  9.0
06/12/2023  7.0  8.0  NaN
...........................
OUTPUT:

date
05/9/2023     1.0,2.0,3.0
07/10/2023            6.0
08/11/2023    4.0,5.0,9.0
06/12/2023        7.0,8.0
Name: ColumnA, dtype: object
Duration: 0:00:00.002998

"""

相关问题