pandas 试图计算一个系列中每一行中每个单词的第一次出现次数

m1m5dgzv 于 2023-08-01 发布在其他

关注(0)|答案(3)|浏览(101)

我有超过100行的调查数据，受访者输入了他们对开放式问题的回答。为了分析他们的回答，我想为每个问题创建一个单词云。我的想法是计算每个回答中的独特词，然后把它们加在一起，看看受访者说得最多的词。例如，如果应答者1说“愤怒”，应答者2说“愤怒”，则连接两个列表将导致单词“愤怒”出现两次。我希望避免的情况是，如果一个人多次使用一个词，我不希望他们多次使用这个词来扭曲数据。如果有人说了100次“愤怒”这个词，但没有人使用它，那么这个词并不能代表所有受访者的情绪。然后我会把这些数据输入到单词云程序中。
首先，我将Excel数据导入到pandas DataFrame中，然后分离出第一个问题的答案

import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

df = pd.read_excel("./Data - Open Ended Questions.xlsx", sheet_name="Working Data")

df1 = df['question1']

字符串
接下来，我尝试遍历系列中的每个响应。下面的程序返回第一个响应中的唯一单词，但我不确定如何返回其他响应中的结果。我想我需要为每个响应创建一个列表，并将它们附加在一起，但我不太确定如何做到这一点。

def words(df):
    l = list()
    
    for index, response in df.items():
            response = response.lower()
            words = response.split()
            
            for word in words:
                if word in l:
                    l
                else:
                    l.append(word)
            return l

words(df1)

型
用户Zombro用他们的回应引导我走上了正确的道路。我的最终解决方案如下。

def unique_words_per_response(series):
    # convert to lowercase to combine words like "The" and "the"
    series = series.str.lower()
    
    # remove punctuation using the String package
    series = series.str.translate(response.maketrans("", "", string.punctuation))
    
    # Zombro's solution
    series = series.apply(lambda series: pd.Series(series.split()).unique()).explode()
    
    # convert words to capitalize first letter of each word to look better in word cloud
    series = series.str.capitalize()
    
    # join words with spaces to create one single string
    series = ' '.join(series.tolist())

    return series

型

pandas

来源：https://stackoverflow.com/questions/76782852/trying-to-count-the-first-occurrence-of-each-word-in-each-row-of-a-series

3条答案

按热度按时间

wvt8vs2t1#

你在寻找这样的解决方案吗？用一些示例数据集来澄清您的问题会有所帮助。

import pandas as pd

s = pd.Series([
    "example response 1",
    "another response",
    "another example response",
    "vanilla bean"
])

s.apply(lambda s: s.split()).explode().value_counts()

>>> 
response    3
example     2
another     2
1           1
vanilla     1
bean        1
dtype: int64

字符串
我认为此解决方案的关键是对数据进行非标准化，然后利用Series.explodehttps://pandas.pydata.org/docs/reference/api/pandas.Series.explode.html

赞(0）回复(0）举报 2023-08-01

bsxbgnwa2#

你可以从每一行中得到第一个单词，

df['question1'].apply(lambda x:x.split()[0])

字符串
编辑-1：
根据OP的注解，得到唯一的单词，没有标点符号

data = {'q1': ["Hello, world!", 
               "This is a sample sample sentence.", 
               "Pandas Pandas is great!"]}
df = pd.DataFrame(data)
translator = str.maketrans('', '', string.punctuation)
unique_words = set(" ".join(df['q1'].str.lower().str.translate(translator).values).split())

型

输出：

{'hello', 'pandas', 'this', 'a', 'great', 'is', 'sample', 'sentence', 'world'}

型

赞(0）回复(0）举报 2023-08-01

chhkpiq43#

是的，要为每个问题创建单词云，您需要分别处理每个回答，然后合并结果。首先，将return语句移到循环之外，实际返回所有响应，而不仅仅是第一个。现在，让我们试着修改你的代码和平;

for index, response in df.items():

字符串
然后，看看循环本身，我们需要遍历df中的项，最后我们只是。

response = response.lower()

型
在这个循环中，我们将响应转换为小写，以确保字频率计数不区分大小写，

words = response.split()
unique_words.append(set(words))

型
在我们从我们得到的响应中创建一个单词列表之后，最后我们将结果添加到列表变量l（如果你使用set（），它将删除重复的单词）。

combined_words_set = set().union(*unique_words)

型
当我们得到我们的话，我们做了一点操作，具体地说，我们需要将所有的集合组合成一个集合，

unique_words_list = list(combined_words_set)

型
之后，我们通过设置成一个列表来达到最终结果。我只能说这些，希望对你有所帮助）
代码应该像这样：

def words(df):
unique_words = [] # rename l into unique_words for clearity

for index, response in df.items():
    response = response.lower()
    words = response.split()
    unique_words.append(set(words))

combined_words_set = set().union(*unique_words)

unique_words_list = list(combined_words_set)

return unique_words_list

unique_words_question1 = words(df1)

型

赞(0）回复(0）举报 2023-08-01

我来回答

pandas 试图计算一个系列中每一行中每个单词的第一次出现次数

3条答案

相关问题

热门标签

最新问答