我有一个DataFrame,包含以下列:INVOICE_DATE、COUNTRY、CUSTOMER_ID、INVOICE_ID、DESCRIPTION、USIM和DEMANDQTY。我想根据特定条件过滤DataFrame。
条件是,如果DESCRIPTION列包含单词“kids”或“baby”,我希望将来自该INVOICE_ID的所有值包含在过滤后的DataFrame中。换句话说,交易中的至少一个项目应该属于儿童或婴儿类别,以便包括整个交易。
我尝试将str.contains()方法与正则表达式模式结合使用,但在获得所需结果时遇到了问题。
下面是我的代码:
import pandas as pd
# Assuming the DataFrame is named 'df'
# Filter the DataFrame based on the condition
filtered_df = df[df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)]
# Print the filtered DataFrame
filtered_df
但是,该代码没有提供预期的结果。它基于单个行而不是考虑整个事务来筛选数据框。
测试数据如下:-
import pandas as pd
import random
import string
import numpy as np
random.seed(42)
np.random.seed(42)
num_transactions = 100
max_items_per_transaction = 6
# Generate a list of possible items
possible_items = [
"Kids T-shirt", "Baby Onesie", "Kids Socks",
"Men's Shirt", "Women's Dress", "Kids Pants",
"Baby Hat", "Women's Shoes", "Men's Pants",
"Kids Jacket", "Baby Bib", "Men's Hat",
"Women's Skirt", "Kids Shoes", "Baby Romper",
"Men's Sweater", "Kids Gloves", "Baby Blanket"
]
# Create the DataFrame
rows = []
for i in range(num_transactions):
num_items = random.randint(1, max_items_per_transaction)
items = random.sample(possible_items, num_items)
invoice_dates = pd.date_range(start='2022-01-01', periods=num_items, freq='D')
countries = random.choices(['USA', 'Canada', 'UK'], k=num_items)
customer_id = i + 1
invoice_id = 1001 + i
for j in range(num_items):
item = items[j]
usim = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6)) # Generate a random 6-character USIM value
demand_qty = random.randint(1, 10)
row = {
'INVOICE_DATE': invoice_dates[j],
'COUNTRY': countries[j],
'CUSTOMER_ID': customer_id,
'INVOICE_ID': invoice_id,
'DESCRIPTION': item,
'USIM': usim,
'DEMANDQTY': demand_qty
}
rows.append(row)
df = pd.DataFrame(rows)
# Print the DataFrame
df
有人能指导我如何根据所描述的条件正确地过滤DataFrame吗?我将非常感谢任何帮助或建议。谢谢你!
1条答案
按热度按时间eh57zj3b1#
假设下面的dataframe:
您可能希望保留INVOICE_ID=123,因为'kids'在第0行的描述中:
输出: