使用python pandas进行表情符号计数和分析

nimxete2  于 2023-06-28  发布在  Python
关注(0)|答案(2)|浏览(138)

我正在研究一个情感分析主题,有很多带有表情符号的评论。
我想知道我的代码是否正确,或者是否有方法优化它?

笑脸计数代码

import pandas as pd
import regex as re
import emoji

# Assuming your DataFrame is called 'df' and the column with comments is 'Document'
comments = df['Document']

# Initialize an empty dictionary to store smiley counts and types
smiley_data = {'Smiley': [], 'Count': [], 'Type': []}

# Define a regular expression pattern to match smileys
pattern = r'([\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF])'

# Iterate over the comments
for comment in comments:
    # Extract smileys and their types from the comment
    smileys = re.findall(pattern, comment)
    
    # Increment the count and store the smileys and their types
    for smiley in smileys:
        if smiley in smiley_data['Smiley']:
            index = smiley_data['Smiley'].index(smiley)
            smiley_data['Count'][index] += 1
        else:
            smiley_data['Smiley'].append(smiley)
            smiley_data['Count'].append(1)
            smiley_data['Type'].append(emoji.demojize(smiley))
            
# Create a DataFrame from the smiley data
smiley_df = pd.DataFrame(smiley_data)

# Sort the DataFrame by count in descending order
smiley_df = smiley_df.sort_values(by='Count', ascending=False)

# Print the smiley data
smiley_df

我主要是不确定,如果我下面的代码块是得到所有的笑脸

# Define a regular expression pattern to match smileys
pattern = r'([\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF])'

我想知道我能用这个分析做什么。在上面放点别的什么图表什么的
我还分享了一个测试数据集,它将生成与我的真实的数据中可用的相似的笑脸计数。请注意,测试数据集只有在有其他东西的情况下才有已知的笑脸。它不会像真实的数据集那样存在。

测试数据集

import random
import pandas as pd

smileys = ['👍', '👌', '😍', '🏻', '😊', '🙂', '👎', '😃', '🏼', '💩']

# Additional smileys to complete the required count
additional_smileys = ['😄', '😎', '🤩', '😘', '🤗', '😆', '😉', '😋', '😇', '🥳', '🙌', '🎉', '🔥', '🥰', '🤪', '😜', '🤓',
                      '😚', '🤭', '🤫', '😌', '🥱', '🥶', '🤮', '🤡', '😑', '😴', '🙄', '😮', '🤥', '😢', '🤐', '🙈', '🙊',
                      '👽', '🤖', '🦄', '🐼', '🐵', '🦁', '🐸', '🦉']

# Combine the required smileys and additional smileys
all_smileys = smileys + additional_smileys

# Set a random seed for reproducibility
random.seed(42)

# Generate a single review
def generate_review(with_smiley=False):
    review = "This movie"
    if with_smiley:
        review += " " + random.choice(all_smileys)
    review += " is "
    review += random.choice(["amazing", "excellent", "fantastic", "brilliant", "great", "good", "okay", "average",
                             "mediocre", "disappointing", "terrible", "awful", "horrible"])
    review += random.choice(["!", "!!", "!!!", ".", "..", "..."]) + " "
    review += random.choice(["Highly recommended", "Definitely worth watching", "A must-see", "I loved it",
                             "Not worth your time", "Skip it"]) + random.choice(["!", "!!", "!!!"])
    return review

# Generate the random dataset
def generate_dataset():
    dataset = []
    review_count = 5000

    # Generate reviews with top smileys
    for smiley, count, _ in top_smileys:
        while count > 0:
            review = generate_review(with_smiley=True)
            if smiley in review:
                dataset.append(review)
                count -= 1

    # Generate reviews with additional smileys
    additional_smileys_count = len(additional_smileys)
    additional_smileys_per_review = review_count - len(dataset)
    additional_smileys_per_review = min(additional_smileys_per_review, additional_smileys_count)

    for _ in range(additional_smileys_per_review):
        review = generate_review(with_smiley=True)
        dataset.append(review)

    # Generate reviews without smileys
    while len(dataset) < review_count:
        review = generate_review()
        dataset.append(review)

    # Shuffle the dataset
    random.shuffle(dataset)
    return dataset

# List of top smileys and their counts
top_smileys = [
    ('👍', 331, ':thumbs_up:'),
    ('👌', 50, ':OK_hand:'),
    ('😍', 41, ':smiling_face_with_heart-eyes:'),
    ('🏻', 38, ':light_skin_tone:'),
    ('😊', 35, ':smiling_face_with_smiling_eyes:'),
    ('🙂', 14, ':slightly_smiling_face:'),
    ('👎', 12, ':thumbs_down:'),
    ('😃', 12, ':grinning_face_with_big_eyes:'),
    ('🏼', 10, ':medium-light_skin_tone:'),
    ('💩', 10, ':pile_of_poo:')
]

# Generate the dataset
dataset = generate_dataset()

# Create a data frame with 'Document' column
df = pd.DataFrame({'Document': dataset})

# Display the DataFrame
df

感谢您的评分

bqjvbblv

bqjvbblv1#

您可以使用str.extractall来避免循环,然后使用value_counts来计算出现的次数。最后,“demojize” 每个笑脸(最慢的部分):

out = (df['Document'].str.extractall(pattern).value_counts()
                     .rename_axis('Smiley').rename('Count').reset_index()
                     .assign(Type=lambda x: x['Smiley'].apply(emoji.demojize)))

输出:

>>> out
   Smiley  Count                              Type
0       👍    331                       :thumbs_up:
1       👌     50                         :OK_hand:
2       🏻     41                 :light_skin_tone:
3       😍     41    :smiling_face_with_heart-eyes:
4       😊     35  :smiling_face_with_smiling_eyes:
5       🙂     15           :slightly_smiling_face:
6       👎     14                     :thumbs_down:
7       😃     13     :grinning_face_with_big_eyes:
8       💩     10                     :pile_of_poo:
9       🏼     10          :medium-light_skin_tone:
10      😜      3        :winking_face_with_tongue:
11      😑      2             :expressionless_face:
12      🙈      2              :see-no-evil_monkey:
13      😢      2                     :crying_face:
14      🙊      2            :speak-no-evil_monkey:
15      👽      2                           :alien:
16      😎      1    :smiling_face_with_sunglasses:
17      😘      1             :face_blowing_a_kiss:
18      😚      1   :kissing_face_with_closed_eyes:
19      🐸      1                            :frog:
20      😇      1          :smiling_face_with_halo:
21      😮      1            :face_with_open_mouth:
22      😆      1         :grinning_squinting_face:
23      🙄      1          :face_with_rolling_eyes:
24      🐼      1                           :panda:

图案部分正确吗?我没有错过任何表情符号?
你的模式不对。我不知道你要提取的完整列表,但下面有一个代码来调试它:

#     add latin1 codes --v
pattern2 = '([\\U00000000-\\U000000FF\\U0001F600-\\U0001F64F\\U0001F300-\\U0001F5FF\\U0001F680-\\U0001F6FF\\U0001F1E0-\\U0001F1FF])'

other = df['Document'].str.replace(pattern2, '', regex=True)
print(other[other != ''])

# Output / Missed emojis
1149    🤗
1238    🦉
1305    🤫
1424    🤫
1978    🤭
2611    🤮
2623    🦉
2959    🤡
3717    🤪
4045    🦉
4067    🤖
4699    🤖
4975    🤪
Name: Document, dtype: object
km0tfn4u

km0tfn4u2#

感谢@corralien和@cuzi,我能够使用下面的代码获得最终结果。它不使用模式,但使用emoji.analyze(text, join_emoji=True)函数:-

import emoji

out = (df['Document'].apply(lambda text: [token.chars for token in emoji.analyze(text, join_emoji=True) 
                       if isinstance(token.value, emoji.EmojiMatch)]).explode().value_counts()
                      .rename_axis('Smiley').rename('Count').reset_index())

out

输出量

index   Smiley  Count
0   👍  331
1   👌  50
2   😍  41
3   🏻  41
4   😊  35
5   🙂  15
6   👎  14
7   😃  13
8   🏼  10
9   💩  10
10  😜  3
11  🦉  3
12  🤖  2
13  🤪  2
14  😢  2
15  🙈  2
16  😑  2
17  🤫  2
18  🙊  2
19  👽  2
20  🤭  1
21  🤗  1
22  😇  1
23  🐸  1
24  🤮  1
25  🤡  1
26  😚  1
27  😎  1
28  😘  1
29  🐼  1
30  😆  1
31  😮  1
32  🙄  1

相关问题