pandas 拆分panda Dataframe 中的几个句子

ffscu2ro 于 2023-01-01 发布在其他

关注(0)|答案(3)|浏览(128)

我有一个panda数据框，它有一列是这样的。
| 句子|
| - ------|
| ["这是文本。"，"这是另一个文本。"，"这也是文本。"，"甚至更多文本。"]|
| ['另一行中的内容相同。'，'另一行中的另一个文本。'，'第二行中的文本。'，'第二行中的最后一个文本。']|
在每一行中有10个用逗号分隔的""或""句子。列类型是"str"。我无法将其转换为字符串列表。
我想转换此 Dataframe 的值，使其如下所示：

[['This', 'is', 'text'], ['This', 'is', 'another', 'text'], ['This', 'is', 'also', 'text'], ['Even', 'more', 'text']]

我试过这样的方法：

new_splits = []
    for num in range(len(refs)):
      komma = refs[num].replace(" ", "\', \'")#regex=True)
      new_splits.append(komma)

还有这个

new_splits = []
    for num in range(len(refs)):
      splitted = refs[num].split("', '")
      new_splits.append(splitted)

免责声明：我需要这个来评估蓝色分数，还没有找到一种方法来做这类数据集。提前感谢!

pandas

来源：https://stackoverflow.com/questions/74943838/split-several-sentences-in-pandas-dataframe

3条答案

按热度按时间

z9ju0rcb1#

可以在一行中使用np.char.split：

df['separated'] = np.char.split(df['sentences'].tolist()).tolist()

@Kata如果你认为sentences的列类型是str，意味着每一行的元素是字符串而不是列表，例如"['This is text.', 'This is another text.', 'This is also text.', 'Even more text.']"，那么你需要先把它们转换成列表，一种方法是使用ast.literal_eval。

from ast import literal_eval
df['sentences'] = df['sentences'].apply(literal_eval)
df['separated'] = np.char.split(df['sentences'].tolist()).tolist()

数据注解：不推荐使用这种存储数据的方式。如果可能，请修复数据的来源。每个单元格中的数据最好是字符串，而不是列表，或者至少只是列表，而不是表示列表的字符串。

赞(0）回复(0）举报 2023-01-01

njthzxwz2#

你可以在你的 Dataframe 上使用apply方法。如果你说每行有10个句子，那么你可以像这样groupby每10个句子。

import pandas as pd

group_labels = [i // 10 for i in range(len(df))]

grouped = df.groupby(group_labels)

result = grouped['sentences'].apply(lambda x: list(x))

print(result)

赞(0）回复(0）举报 2023-01-01

hrysbysz3#

对于df Dataframe ，您可以尝试以下操作：

df["splitted"] = (
    df["sentences"]
    .str.strip("[]\'\"").str.split("\'. \'|\'. \"|\". \'|\". \"")
    .explode()
    .str.findall(r"\b([^ ]+?)\b")
    .groupby(level=0).agg(list)
)

从行的开头和结尾开始，先执行.strip、[、]、'和"。
然后.split将行转换成句子列表。
.explode返回结果列，通过.findall将句子中的单词提取到列表中。
然后将相应的单词列表重新组合在一个列表中。

结果df["splitted]

df = pd.DataFrame({
    "sentences": [
        """['This is text.', 'This is another text.', 'This is also text.', 'Even more text.']""",
        """["This is the same in another row.", 'Another row another text.', 'Text in second row.', 'Last text in second row.']"""
    ]
})

是

0  [['This', 'is', 'text'], ['This', 'is', 'another', 'text'], ['This', 'is', 'also', 'text'], ['Even', 'more', 'text']]
1  [['This', 'is', 'the', 'same', 'in', 'another', 'row'], ['Another', 'row', 'another', 'text'], ['Text', 'in', 'second', 'row'], ['Last', 'text', 'in', 'second', 'row']]

赞(0）回复(0）举报 2023-01-01

我来回答

pandas 拆分panda Dataframe 中的几个句子

3条答案

相关问题

热门标签

最新问答