Pandas dataframe -按NaN值分隔句子

7uzetpgm  于 2023-06-20  发布在  其他
关注(0)|答案(1)|浏览(127)

我有一个Pandas数据框架,每行都有一个名为“Word”的列。每个句子的分隔符都是一个空字符串"",所以我使用skip_blank_lines来查看分隔符。

df = pd.read_csv("Data-June-2023.txt", sep=" ",skip_blank_lines=False)
df.tail(20)

Index   Word    _   _   Tag

0   I   _   _   O
1   am  _   _   O
2   from    _   _   O
3   Madrid  _   _   B-City
4   NaN   NaN  NaN  NaN
5   Alice   _   _   B-Person
6   likes   _   _   O
7   Bob _   _   B-Person

我想创建一个名为"Sentence #"的新列,方法是在空行或NaN值上进行迭代。在“Word”中的每个NaN值处,它将为Sentence创建新句子的新计数:1、判决:2、判决:3等

Index   Sentence #  Word    _   _   Tag

0   Sentence: 1 I   _   _   O
1               am  _   _   O
2               from    _   _   O
3               Oxford  _   _   B-City
4               NaN NaN NaN NaN
5   Sentence: 2 Alice   _   _   B-Person
6               likes   _   _   O
7               Bob _   _   B-Person
8               NaN NaN NaN NaN
9   Sentence: 3 Alice   _   _   B-Person

感谢您的评分

szqfcxe2

szqfcxe21#

使用boolean indexing:

m = df['Word'].isna().shift(fill_value=True)
df.loc[m, 'Sentence'] = m.cumsum().astype(str).radd('Sentence: ')

输出:

Index    Word    _    _       Tag     Sentence
0      0       I    _    _         O  Sentence: 1
1      1      am    _    _         O          NaN
2      2    from    _    _         O          NaN
3      3  Madrid    _    _    B-City          NaN
4      4     NaN  NaN  NaN       NaN          NaN
5      5   Alice    _    _  B-Person  Sentence: 2
6      6   likes    _    _         O          NaN
7      7     Bob    _    _  B-Person          NaN

相关问题