pandas 调整大型数据集的嵌套apply()函数

bwitn5fc  于 2022-12-25  发布在  其他
关注(0)|答案(1)|浏览(175)

我尝试比较两个 Dataframe ,但遇到了卷问题。
我将一个新商品描述的一行传递到一个450万行的库存列表中,然后计算相似度,我只需要前x个推荐,我意识到我目前的方法很快就会被大量的数据淹没,并导致内核崩溃。
我以前没有处理过这种数据大小,所以我不确定如何调整我的代码。
任何建议都是非常感谢的,目前的方法是先将数据放入dataframe(holding_df),然后groupby来收集最好的建议,但是一旦这个过程扩展到数据的完整大小,它就会崩溃。

> df.head()

   item_desc  
0  paintbrush  
1  mop #2  
2  red bucket  
3  o-light flashlight
> df_inventory.head()
   item_desc  
0  broom  
1  mop  
2  bucket  
3  flashlight
import pandas as pd

from fuzzywuzzy import fuzz

def calculate_similarity(x, y):
    sample_list.append(
        {
            "New Item": x,
            "Inventory Item": y,
            "Similarity": fuzz.ratio(str(x).lower(), str(y).lower()),
        }
    )
    return

sample_list = []

df = pd.DataFrame(
    {"ITEM_DESC": ["paintbrush", "mop #2", "red bucket", "o-light flashlight"]}
)

df_inventory = pd.DataFrame({"ITEM_DESC": ["broom", "mop", "bucket", "flashlight"]})

temp = df["ITEM_DESC"].apply(
    lambda x: df_inventory["ITEM_DESC"].apply(lambda y: calculate_similarity(x, y))
)

holding_df = pd.DataFrame(sample_list)
szqfcxe2

szqfcxe21#

我用普通Python实现了一些不会破坏内核的东西,但它不会很快。
将一个新产品与整个库存进行比较需要6-7秒,对于3.5k个条目来说可能太慢了(如果在我的机器上运行的话,大约需要6小时20分钟),不过通过一些工作,它可以并行化。

6.5s per new item
3500 * 6.5 / 3600 (s/h) -> 6h 20min

主要的内存节省器是FixedSizeLeaderboard类,我实现了该类来跟踪新产品中前n个最相似的项,由于该任务现在是CPU受限的,而不是真正的内存受限的,因此您可以稍微重写它以使用多处理模块。
我决定只生成一些测试数据,这些数据可能代表实际性能,也可能不代表实际性能,我在插入数据的地方添加了一些注解。

import bisect
import collections
import contextlib
import itertools
import time
import typing
import uuid

from fuzzywuzzy import fuzz

@contextlib.contextmanager
def log_runtime(task: str):
    """Contextmanager that logs the runtime of a piece of code."""

    start = time.perf_counter()

    yield

    runtime = time.perf_counter() - start

    print("Task '%s' took %.4f seconds" % (task, runtime))

def inventory_generator() -> typing.Iterable[str]:
    """Returns an iterable that yields product names."""

    def string_generator() -> typing.Iterable[str]:
        while True:
            yield str(uuid.uuid4())
            yield from ("aaa", "aba", "def", "dse", "asd")

    yield from string_generator()

class FixedSizeLeaderboard:

    size: int
    _min_score: int
    _items: typing.List[typing.Tuple[int, object]]

    def __init__(self, size) -> None:
        self.size = size
        self._items = []
        self._min_score = None

    def add(self, score: int, item: object) -> None:

        if len(self._items) < self.size or score > self._min_score:
            self._eject_element_with_lowest_score()
            bisect.insort(self._items, (score, item))
            self._min_score = self._items[0][0]

    def _eject_element_with_lowest_score(self) -> None:
        if len(self._items) == self.size:
            # The list is sorted, so we can pop the first one
            self._items.pop(0)

    def get(self) -> typing.List[typing.Tuple[int, object]]:
        return sorted(self._items, reverse=True)

def main():

    num_new_products = 2
    num_products_in_inventory = 4_500_000
    top_n_similarities = 3

    with log_runtime("Generate dummy-products"):

        # Convert everything to lowercase once.
        # This is not really required for uuids, but it should happen ONCE
        # Instead of the inventory_generator, you'd pass the content of your dataframe here.
        new_products = list(
            map(str.lower, itertools.islice(inventory_generator(), num_new_products))
        )
        inventoried_products = list(
            map(
                str.lower,
                itertools.islice(inventory_generator(), num_products_in_inventory),
            )
        )

    task_desc = (
        f"{num_new_products} x {num_products_in_inventory}"
        f" = {num_new_products * num_products_in_inventory} similarity computations"
    )

    product_to_leaderboard: typing.Dict[
        str, FixedSizeLeaderboard
    ] = collections.defaultdict(lambda: FixedSizeLeaderboard(top_n_similarities))

    with log_runtime(task_desc):
        for new_product, existing_product in itertools.product(
            new_products, inventoried_products
        ):

            similarity = fuzz.ratio(new_product, existing_product)
            product_to_leaderboard[new_product].add(similarity, existing_product)

    # Sort of pretty output formatting
    for product, similarities in product_to_leaderboard.items():
        print("=" * 3, "New Product", product, "=" * 3)
        for position, (score, product) in enumerate(similarities.get()):
            print(f"{position + 1:02}. score: {score} product: {product}")

if __name__ == "__main__":
    main()

如果我们执行它,我们会得到这样的结果:

$ python apply_thingy.py
Task 'Generate dummy-products' took 1.6449 seconds
Task '2 x 4500000 = 9000000 similarity computations' took 12.0887 seconds
=== New Product 2d10f990-355e-42f6-b518-0a21a7fb8d5c ===
01. score: 56 product: f2100878-3c3e-4f86-b410-3c362184d195
02. score: 56 product: 5fc9b30c-35ed-4167-b997-1bf0a2af5b68
03. score: 56 product: 523210b2-e5e0-496a-b0b1-a1b2af49b0d5
=== New Product aaa ===
01. score: 100 product: aaa
02. score: 100 product: aaa
03. score: 100 product: aaa

相关问题