更改Pandas数据框中每组的第一个元素

cgvd09ve  于 2022-12-16  发布在  其他
关注(0)|答案(4)|浏览(146)

我想确保每个vintage对应的val2的第一个值是NaN,目前已经有两个是NaN,但我想确保0.53也更改为NaN

df = pd.DataFrame({
        'vintage': ['2017-01-01', '2017-01-01', '2017-01-01', '2017-02-01', '2017-02-01', '2017-03-01'],
        'date': ['2017-01-01', '2017-02-01', '2017-03-01', '2017-02-01', '2017-03-01', '2017-03-01'],
        'val1': [0.59, 0.68, 0.8, 0.54, 0.61, 0.6],
        'val2': [np.nan, 0.66, 0.81, 0.53, 0.62, np.nan]
    })

以下是我目前所做的尝试:

df.groupby('vintage').first().val2 #This gives the first non-NaN values, as shown below

vintage
2017-01-01    0.66
2017-02-01    0.53
2017-03-01     NaN

df.groupby('vintage').first().val2 = np.nan #This doesn't change anything
df.val2

0     NaN
1    0.66
2    0.81
3    0.53
4    0.62
5     NaN
jei2mxaa

jei2mxaa1#

你不能赋值给一个聚合的结果,而且first忽略了现有的NaN,你可以做的是调用head(1),它将返回每个组的第一行,并将索引传递给loc来屏蔽orig df以覆盖那些列的值:

In[91]
df.loc[df.groupby('vintage')['val2'].head(1).index, 'val2'] = np.NaN
df:

Out[91]: 
         date  val1  val2     vintage
0  2017-01-01  0.59   NaN  2017-01-01
1  2017-02-01  0.68  0.66  2017-01-01
2  2017-03-01  0.80  0.81  2017-01-01
3  2017-02-01  0.54   NaN  2017-02-01
4  2017-03-01  0.61  0.62  2017-02-01
5  2017-03-01  0.60   NaN  2017-03-01

在这里,您可以看到head(1)返回每个组的第一行:

In[94]:
df.groupby('vintage')['val2'].head(1)
Out[94]: 
0     NaN
3    0.53
5     NaN
Name: val2, dtype: float64

first相反,first将返回第一个非NaN,除非该组只有NaN值:

In[95]:
df.groupby('vintage')['val2'].first()

Out[95]: 
vintage
2017-01-01    0.66
2017-02-01    0.53
2017-03-01     NaN
Name: val2, dtype: float64
70gysomp

70gysomp2#

或者创建位置,选择第一个,将val2更改为np.nan

df.loc[df.groupby('vintage').vintage.cumcount()==0,'val2']=np.nan
df
Out[154]: 
         date  val1  val2     vintage
0  2017-01-01  0.59   NaN  2017-01-01
1  2017-02-01  0.68  0.66  2017-01-01
2  2017-03-01  0.80  0.81  2017-01-01
3  2017-02-01  0.54   NaN  2017-02-01
4  2017-03-01  0.61  0.62  2017-02-01
5  2017-03-01  0.60   NaN  2017-03-01
to94eoyn

to94eoyn3#

我想你也可以这样写:

def h(x):
 x['val2'].iloc[0] = np.NaN
 return x

df = df.groupby("vintage").apply(h)
bmvo0sr5

bmvo0sr54#

时间:

df = pd.DataFrame({
        'vintage': ['2017-01-01', '2017-01-01', '2017-01-01', '2017-02-01', '2017-02-01', '2017-03-01'],
        'date': ['2017-01-01', '2017-02-01', '2017-03-01', '2017-02-01', '2017-03-01', '2017-03-01'],
        'val1': [0.59, 0.68, 0.8, 0.54, 0.61, 0.6],
        'val2': [np.nan, 0.66, 0.81, 0.53, 0.62, np.nan]
    })

def BENY(df):
    df.loc[df.groupby('vintage').vintage.cumcount() == 0, 'val2'] = np.nan
    
def EdChum(df):
    df.loc[df.groupby('vintage')['val2'].head(1).index, 'val2'] = np.nan
    
def knoble(df):
    def func(x):
        x['val2'].iloc[0] = np.nan
        return x
    df.groupby("vintage", group_keys=False).apply(func)

%timeit BENY(df)
406 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit EdChum(df)
454 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit knoble(df)
1.07 ms ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

字符串

相关问题