我正在研究一个情感分析主题,有很多带有表情符号的评论。
我想知道我的代码是否正确,或者是否有方法优化它?
笑脸计数代码
import pandas as pd
import regex as re
import emoji
# Assuming your DataFrame is called 'df' and the column with comments is 'Document'
comments = df['Document']
# Initialize an empty dictionary to store smiley counts and types
smiley_data = {'Smiley': [], 'Count': [], 'Type': []}
# Define a regular expression pattern to match smileys
pattern = r'([\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF])'
# Iterate over the comments
for comment in comments:
# Extract smileys and their types from the comment
smileys = re.findall(pattern, comment)
# Increment the count and store the smileys and their types
for smiley in smileys:
if smiley in smiley_data['Smiley']:
index = smiley_data['Smiley'].index(smiley)
smiley_data['Count'][index] += 1
else:
smiley_data['Smiley'].append(smiley)
smiley_data['Count'].append(1)
smiley_data['Type'].append(emoji.demojize(smiley))
# Create a DataFrame from the smiley data
smiley_df = pd.DataFrame(smiley_data)
# Sort the DataFrame by count in descending order
smiley_df = smiley_df.sort_values(by='Count', ascending=False)
# Print the smiley data
smiley_df
我主要是不确定,如果我下面的代码块是得到所有的笑脸
# Define a regular expression pattern to match smileys
pattern = r'([\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF])'
我想知道我能用这个分析做什么。在上面放点别的什么图表什么的
我还分享了一个测试数据集,它将生成与我的真实的数据中可用的相似的笑脸计数。请注意,测试数据集只有在有其他东西的情况下才有已知的笑脸。它不会像真实的数据集那样存在。
测试数据集
import random
import pandas as pd
smileys = ['👍', '👌', '😍', '🏻', '😊', '🙂', '👎', '😃', '🏼', '💩']
# Additional smileys to complete the required count
additional_smileys = ['😄', '😎', '🤩', '😘', '🤗', '😆', '😉', '😋', '😇', '🥳', '🙌', '🎉', '🔥', '🥰', '🤪', '😜', '🤓',
'😚', '🤭', '🤫', '😌', '🥱', '🥶', '🤮', '🤡', '😑', '😴', '🙄', '😮', '🤥', '😢', '🤐', '🙈', '🙊',
'👽', '🤖', '🦄', '🐼', '🐵', '🦁', '🐸', '🦉']
# Combine the required smileys and additional smileys
all_smileys = smileys + additional_smileys
# Set a random seed for reproducibility
random.seed(42)
# Generate a single review
def generate_review(with_smiley=False):
review = "This movie"
if with_smiley:
review += " " + random.choice(all_smileys)
review += " is "
review += random.choice(["amazing", "excellent", "fantastic", "brilliant", "great", "good", "okay", "average",
"mediocre", "disappointing", "terrible", "awful", "horrible"])
review += random.choice(["!", "!!", "!!!", ".", "..", "..."]) + " "
review += random.choice(["Highly recommended", "Definitely worth watching", "A must-see", "I loved it",
"Not worth your time", "Skip it"]) + random.choice(["!", "!!", "!!!"])
return review
# Generate the random dataset
def generate_dataset():
dataset = []
review_count = 5000
# Generate reviews with top smileys
for smiley, count, _ in top_smileys:
while count > 0:
review = generate_review(with_smiley=True)
if smiley in review:
dataset.append(review)
count -= 1
# Generate reviews with additional smileys
additional_smileys_count = len(additional_smileys)
additional_smileys_per_review = review_count - len(dataset)
additional_smileys_per_review = min(additional_smileys_per_review, additional_smileys_count)
for _ in range(additional_smileys_per_review):
review = generate_review(with_smiley=True)
dataset.append(review)
# Generate reviews without smileys
while len(dataset) < review_count:
review = generate_review()
dataset.append(review)
# Shuffle the dataset
random.shuffle(dataset)
return dataset
# List of top smileys and their counts
top_smileys = [
('👍', 331, ':thumbs_up:'),
('👌', 50, ':OK_hand:'),
('😍', 41, ':smiling_face_with_heart-eyes:'),
('🏻', 38, ':light_skin_tone:'),
('😊', 35, ':smiling_face_with_smiling_eyes:'),
('🙂', 14, ':slightly_smiling_face:'),
('👎', 12, ':thumbs_down:'),
('😃', 12, ':grinning_face_with_big_eyes:'),
('🏼', 10, ':medium-light_skin_tone:'),
('💩', 10, ':pile_of_poo:')
]
# Generate the dataset
dataset = generate_dataset()
# Create a data frame with 'Document' column
df = pd.DataFrame({'Document': dataset})
# Display the DataFrame
df
感谢您的评分
2条答案
按热度按时间bqjvbblv1#
您可以使用
str.extractall
来避免循环,然后使用value_counts
来计算出现的次数。最后,“demojize” 每个笑脸(最慢的部分):输出:
图案部分正确吗?我没有错过任何表情符号?
你的模式不对。我不知道你要提取的完整列表,但下面有一个代码来调试它:
km0tfn4u2#
感谢@corralien和@cuzi,我能够使用下面的代码获得最终结果。它不使用模式,但使用
emoji.analyze(text, join_emoji=True)
函数:-输出量