python 如何在DataFrame中计算字符串中的单词数量？[重复]

zfciruhq 于 2023-04-19 发布在 Python

关注(0)|答案(2)|浏览(130)

此问题已在此处有答案：

Count number of words per row（6个回答）
4年前关闭。
假设我们有一个简单的数据框架

df = pd.DataFrame([
    'one apple',
    'banana',
    'box of oranges',
    'pile of fruits outside',
    'one banana',
    'fruits'])
df.columns = ['fruits']

如何计算关键字的字数，类似于：

python

来源：https://stackoverflow.com/questions/37483470/how-to-calculate-number-of-words-in-a-string-in-dataframe

2条答案

按热度按时间

8zzbczxx1#

IIUC，然后您可以执行以下操作：

In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[89]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

这里我们使用向量化的str.split在空间上进行分割，然后使用applylen来获得元素数量的计数，然后我们可以调用value_counts来聚合频率计数。
然后我们重命名索引并对其进行排序以获得所需的输出

更新

这也可以使用str.len而不是apply来完成，apply应该可以更好地扩展：

In [41]:
count = df['fruits'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[41]:
0 words:    2
1 words:    1
2 words:    3
3 words:    4
4 words:    2
5 words:    1
Name: fruits, dtype: int64

时间

In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

1000 loops, best of 3: 799 µs per loop
1000 loops, best of 3: 347 µs per loop

对于6K df：

In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop

赞(0）回复(0）举报 2023-04-19

vatpfxk52#

您可以使用str.count和空格' '作为分隔符。

In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False)

In [1717]: count.index = count.index.astype('str') + ' words:'

In [1718]: count
Out[1718]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

时间

str.count稍微快一点

小

In [1724]: df.shape
Out[1724]: (6, 1)

In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1000 loops, best of 3: 649 µs per loop

In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts()
1000 loops, best of 3: 840 µs per loop

中

In [1728]: df.shape
Out[1728]: (6000, 1)

In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
100 loops, best of 3: 6.58 ms per loop

In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts()
100 loops, best of 3: 6.99 ms per loop

大号

In [1732]: df.shape
Out[1732]: (60000, 1)

In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1 loop, best of 3: 57.6 ms per loop

In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts()
1 loop, best of 3: 73.8 ms per loop

赞(0）回复(0）举报 2023-04-19

我来回答

python 如何在DataFrame中计算字符串中的单词数量？[重复]

2条答案

相关问题

热门标签

最新问答