python删除大多数列为NaN的行

bvuwiixz  于 2021-08-25  发布在  Java
关注(0)|答案(1)|浏览(431)

我正在从excel导入数据,其中一些行可能在列中有注解,并且不是dataframe的一部分。例如,下面:

H1            H2           H3

* highlighted cols are PII

sam          red           5
pam          blue          3
rod          green         11
 * this is the end of the data

将上述文件导入dfpa时,看起来如下所示:

dfPA:
Index     H1            H2           H3
1         *highlighted cols are PII
2         sam          red           5
3         pam          blue          3
4         rod          green         11
5         * this is the end of the data

我想删除第一行和最后一行。这就是我所做的。


# get count of cols in df

input: cntcols = dfPA.shape[1]
output: 3

# get count of cols with nan in df

input: a = dfPA.shape[1] - dfPA.count(axis=1)
output: 
0     2
1     3
2     3
4     3
5     2
(where a is a series)

# convert a from series to df

dfa = a.to_frame()

# delete rows where no. of nan's are greater than 'n'

n = 1
for r, row in dfa.iterrows():
    if (cntcols - dfa.iloc[r][0]) > n:
        i = row.name
        dfPA = dfPA.drop(index=i)

这不管用。有办法做到这一点吗?

hgtggwj0

hgtggwj01#

您应该使用pandas.dataframe.dropna方法。它有一个 thresh 参数,可用于定义要删除行/列的最小nan数。
设想以下 Dataframe :

>>> import numpy as np
>>> df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list('ABCD'))

     A    B  C    D
0  1.0  NaN  1  NaN
1  1.0  1.0  1  1.0
2  1.0  NaN  1  1.0
3  NaN  1.0  1  1.0

可以使用以下命令删除带有nan的列:

>>> df.dropna(axis=1)

   C
0  1
1  1
2  1
3  1

这个 thresh 参数定义保留该列的非nan值的最小数目:

>>> df.dropna(thresh=3, axis=1)

     A  C    D
0  1.0  1  NaN
1  1.0  1  1.0
2  1.0  1  1.0
3  NaN  1  1.0

如果您想根据nan的数量进行推理:


# example for a minimum of 2 NaN to drop the column

>>> df.dropna(thresh=len(df.columns)-(2-1), axis=1)

如果需要筛选行而不是列,请删除axis参数或使用 axis=0 :

>>> df.dropna(thresh=3)

相关问题