regex 当字符串满足条件时,从列中移除该字符串

ekqde3dh  于 12个月前  发布在  其他
关注(0)|答案(3)|浏览(88)

当字符串列包含小写字母时,我想从字符串列中删除该字符串(字符串列可能是NaN或在一行中包含多个字符串)
| 列2|列3| Column3 |
| --|--| ------------ |
| NaN| NaN| NaN |
| NaN| NaN| NaN |
| NaN| NaN| NaN |
| BCSTACK| BCTENSORFLOW| BCTENSORFLOW |
| 溢出|NaN| NaN |
原来的df是看起来像上面
我已经尝试了“str.contains”函数来定位和替换它,当它包含小写字母
由于str函数不能用于NaN值,所以我首先将NaN值替换为字符串'nan',
然后用正则表达式替换所有的小写字母。既然‘nan’也是一个小写字母,它也应该被替换掉

df['Column1'].fillna('nan',inplace=True)
df['Column2'].fillna('nan',inplace=True)
df['Column3'].fillna('nan',inplace=True)

lowerletterpattern = r'[a-z]*'

mask1 = df['Column1'].str.contains(lowerletterpattern)
df.loc[mask1,'Column1'] = np.nan

mask2 = df['Column2'].str.contains(lowerletterpattern)
df.loc[mask2,'Column2'] = np.nan

mask3 = df['Column3'].str.contains(lowerletterpattern)
df.loc[mask3,'Column3'] = np.nan

字符串
但df返回的全是NaN值
以下是预期结果:
| 列2|列3| Column3 |
| --|--| ------------ |
| NaN| NaN| NaN |
| NaN| NaN| NaN |
| NaN| NaN| NaN |
| BCSTACK| BCTENSORFLOW| BCTENSORFLOW |
| 溢出|NaN| NaN |

qoefvg9y

qoefvg9y1#

一个选项,检查[a-z]str.containsmask

out = df.mask(df.apply(lambda s: s.str.contains('[a-z]').fillna(True)))

字符串
或者,使用replace

out = df.replace('.*[a-z].*', float('nan'), regex=True)


输出量:

Column1   Column2       Column3
0               NaN       NaN           NaN
1  BCDE8ENGUGUETNJN       NaN           NaN
2               NaN       NaN           NaN
3               NaN   BCSTACK  BCTENSORFLOW
4               NaN  OVERFLOW           NaN

bkkx9g8r

bkkx9g8r2#

另一个解决方案,使用Series.where

df = df.apply(
    lambda row: row.where(~row.str.contains(r"[a-z]").astype(bool)),
    axis=1,
)
print(df)

字符串
印刷品:

Column1   Column2       Column3
0               NaN       NaN           NaN
1  BCDE8ENGUGUETNJN       NaN           NaN
2               NaN       NaN           NaN
3               NaN   BCSTACK  BCTENSORFLOW
4               NaN  OVERFLOW           NaN

mm5n2pyu

mm5n2pyu3#

你可以使用str.isupper来避免正则表达式:

df = df.applymap(lambda x: x if type(x) == str and x.isupper() else np.NaN)

字符串

相关问题