在Python中根据特定条件过滤DataFrame

13z8s7eq  于 2023-07-01  发布在  Python
关注(0)|答案(1)|浏览(83)

我有一个DataFrame,包含以下列:INVOICE_DATE、COUNTRY、CUSTOMER_ID、INVOICE_ID、DESCRIPTION、USIM和DEMANDQTY。我想根据特定条件过滤DataFrame。

条件是,如果DESCRIPTION列包含单词“kids”或“baby”,我希望将来自该INVOICE_ID的所有值包含在过滤后的DataFrame中。换句话说,交易中的至少一个项目应该属于儿童或婴儿类别,以便包括整个交易。
我尝试将str.contains()方法与正则表达式模式结合使用,但在获得所需结果时遇到了问题。
下面是我的代码:

import pandas as pd

# Assuming the DataFrame is named 'df'

# Filter the DataFrame based on the condition
filtered_df = df[df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)]

# Print the filtered DataFrame
filtered_df

但是,该代码没有提供预期的结果。它基于单个行而不是考虑整个事务来筛选数据框。
测试数据如下:-

import pandas as pd
import random
import string
import numpy as np

random.seed(42)
np.random.seed(42)

num_transactions = 100
max_items_per_transaction = 6

# Generate a list of possible items
possible_items = [
    "Kids T-shirt", "Baby Onesie", "Kids Socks",
    "Men's Shirt", "Women's Dress", "Kids Pants",
    "Baby Hat", "Women's Shoes", "Men's Pants",
    "Kids Jacket", "Baby Bib", "Men's Hat",
    "Women's Skirt", "Kids Shoes", "Baby Romper",
    "Men's Sweater", "Kids Gloves", "Baby Blanket"
]

# Create the DataFrame
rows = []

for i in range(num_transactions):
    num_items = random.randint(1, max_items_per_transaction)
    items = random.sample(possible_items, num_items)
    invoice_dates = pd.date_range(start='2022-01-01', periods=num_items, freq='D')
    countries = random.choices(['USA', 'Canada', 'UK'], k=num_items)
    customer_id = i + 1
    invoice_id = 1001 + i

    for j in range(num_items):
        item = items[j]
        usim = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6))  # Generate a random 6-character USIM value
        demand_qty = random.randint(1, 10)

        row = {
            'INVOICE_DATE': invoice_dates[j],
            'COUNTRY': countries[j],
            'CUSTOMER_ID': customer_id,
            'INVOICE_ID': invoice_id,
            'DESCRIPTION': item,
            'USIM': usim,
            'DEMANDQTY': demand_qty
        }
        rows.append(row)

df = pd.DataFrame(rows)

# Print the DataFrame
df

有人能指导我如何根据所描述的条件正确地过滤DataFrame吗?我将非常感谢任何帮助或建议。谢谢你!

eh57zj3b

eh57zj3b1#

假设下面的dataframe:

>>> df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123
3     another         456
4         one         456

您可能希望保留INVOICE_ID=123,因为'kids'在第0行的描述中:

m = df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)
filtered_df = df[m.groupby(df['INVOICE_ID']).transform('max')]

输出:

>>> filtered_df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123

相关问题