django 如何提高大表外键上bulk_update的速度?

hfwmuf9z  于 2023-08-08  发布在  Go
关注(0)|答案(3)|浏览(178)

下面的代码可以工作,但运行速度非常慢,如果我尝试使用太多PostalCode对象,就会超时。Address模型有大约100万个对象,查询每个循环都要花很长时间。显然,我这样做是错误的方式,所以任何帮助将不胜感激!这是我第一次处理大量数据。我看了看并尝试了交易(它确实工作,但仍然很慢),但我不确定这将如何帮助在这种情况下。如果它需要时间来运行,这是确定的,没有人会使用它,当我这样做定期更新,但它不能下降8小时或类似的东西,而这个过程运行。我知道我可以用celery 来卸下这些东西,但我还没有在我的学习中走得那么远,真的需要先了解这些基本原理。
型号:

class Address(models.Model):
    name = models.CharField(max_length=200, blank=True, null=True)
    address = models.CharField(max_length=300, blank=True, null=True)
    city = models.CharField(max_length=100, blank=True, null=True)
    province = models.CharField(max_length=100, blank=True, null=True)
    postal_code = models.CharField(max_length=6, blank=True, null=True)

    class Meta:
       indexes = [
           models.Index(fields=['postal_code',]),
              ]

    def __str__(self):
        return self.address


class PostalCode(models.Model):

    postal_code = models.CharField(max_length=6, blank=True, null=True)
    address = models.ForeignKey('Address', on_delete=models.DO_NOTHING, blank=True, null=True)

    class Meta:
       indexes = [
           models.Index(fields=['postal_code',]),
              ]

    def __str__(self):
        return self.postal_code

字符串
查看:

def update_postal_code_addresses(request):
    
    postal_codes = PostalCode.objects.all()
    addresses = Address.objects.all()

    objs = []

    for pc in postal_codes.iterator():
        try:
            obj = addresses.get(postal_code=pc.postal_code) # this table has about 1M rows
            pc.address = obj
            objs.append(pc)

        except Address.DoesNotExist:
            pass

    PostalCode.objects.bulk_update(objs, ['address'], batch_size=100)

    return redirect('home')

ffscu2ro

ffscu2ro1#

在这里获取Address对象是瓶颈。您可以批量获取Adress对象并使用字典,如:

def update_postal_code_addresses(request):    
    postal_codes = PostalCode.objects.all()
    addresses = Address.objects.only('pk', 'postal_code')
    address_dict = {
        address.postal_code: address.pk
        for adress in addresses.iterator()
    }

    objs = []
    for pc in PostalCode.iterator():
        adr = address_dict.get(pc.postal_code)
        if adr is not None:
            pc.address_id = adr
            objs.add(pc)

    PostalCode.objects.bulk_update(objs, ['address'], batch_size=100)

    return redirect('home')

字符串
注意,如果有两个Address具有 * 相同的 * postal_code,那么我们将使用恰好是addresses查询集中最新的一个。

pxq42qpu

pxq42qpu2#

所有对Address表的读调用都是导致查询变慢的原因。有很多方法可以根据您的资源来帮助您的查询,但第一步是通过迭代到Address表而不是逐个收集它们,并通过将PostalCode存储在内存中(假设它是一个足够小的表,您可以这样做)来减少读取量。
我认为PostalCode比Address小得多,所以存储PostalCode更有意义,而且它允许我们在Address上进行子查询以检索较少的项目,而且由于您在postal_code字段上有Address索引,所以查询不应该太昂贵。
在你的模型中,PostalCode的postal_code字段不是唯一的,所以我们将按postal_code对PostalCode表进行分组,并假设几个PostalCode可能匹配同一个postal_code:

from collections import defaultdict

...

def update_postal_code_addresses(request):

    # Store the PostalCode table in a dictionary mapping its postal_code with the objects associated to it.
    postal_codes= defaultdict(list)
    for postal_code in PostalCode.objects.all():
        postal_codes[postal_code.postal_code].append(postal_code)

    # Retrieve a QuerySet of Addresses that have a postal code within our PostalCode table. We only need the primary key and the postal_code to update our foreign key.
    addresses = Address.objects.filter(postal_code__in=set(postal_codes)).values_list('pk', 'postal_code')

    objs = []

    # We iterate over our addresses, updating the `address` field of our PostalCodes with the address' primary key.
    for address_id, postal_code in addresses:
        for p in postal_codes[postal_code]:
            p.address_id = address_id  # We only need the primary key of Address to set the foreign key.
            objs.append(p)

    PostalCode.objects.bulk_update(objs, ['address'], batch_size=100)

    return redirect('home')

字符串
这将大大提高你的表现。

6ojccjat

6ojccjat3#

一般建议。queryset越大,batch_size参数应该越小。从0到max1000是最佳的。

相关问题