python-3.x 下载bz2，读取内存中的压缩文件(避免内存溢出)

x33g5p2x 于 2023-02-14 发布在 Python

关注(0)|答案(2)|浏览(153)

正如标题所说，我正在下载一个bz 2文件，里面有一个文件夹和大量的文本文件...
我的第一个版本是在内存中解压缩，但虽然它只有90 mbs当你uncomrpess它，它有60个文件的750 mb每个...电脑去bum！显然不能处理像40 gb的ram XD）
所以，问题是，他们太大了，不能同时保存在内存中的所有...所以我使用这个代码，工程，但它的吮吸（太慢）：

response = requests.get('https:/fooweb.com/barfile.bz2')

# save file into disk:
compress_filepath = '{0}/files/sources/{1}'.format(zsets.BASE_DIR, check_time)
with open(compress_filepath, 'wb') as local_file:
    local_file.write(response.content)

#We extract the files into folder 
extract_folder = compress_filepath + '_ext'
with tarfile.open(compress_filepath, "r:bz2") as tar:
    tar.extractall(extract_folder)

# We process one file at a time:
for filename in os.listdir(extract_folder):
    filepath = '{0}/{1}'.format(extract_folder,filename)
    file = open(filepath, 'r').readlines()
    
    for line in file:
        some_processing(line)

有没有一种方法，我可以做到这一点，而不是转储到磁盘...只有解压缩和阅读一个文件从.bz2的时间？
非常感谢您的时间提前，我希望有人知道如何帮助我这...

python-3.x

来源：https://stackoverflow.com/questions/68171136/download-bz2-read-compress-files-in-memory-avoid-memory-overflow

2条答案

按热度按时间

b0zn9rqh1#

#!/usr/bin/python3
import sys
import requests
import tarfile
got = requests.get(sys.argv[1], stream=True)
with tarfile.open(fileobj=got.raw, mode='r|*') as tar:
    for info in tar:
        if info.isreg():
            ent = tar.extractfile(info)
            # now process ent as a file, however you like
            print(info.name, len(ent.read()))

赞(0）回复(0）举报 2023-02-14

nnvyjq4y2#

我是这么做的：

response = requests.get(my_url_to_file)
memfile = io.BytesIO(response.content)
# We extract files in memory, one by one:
tar = tarfile.open(fileobj=memfile, mode="r:bz2")
for member_name in tar.getnames():
    filecount+=1
    file = tar.extractfile(member_name)
 
    with open(file, 'r') as read_file:
        for line in read_file:
            process_line(line)

赞(0）回复(0）举报 2023-02-14

我来回答

python-3.x 下载bz2，读取内存中的压缩文件(避免内存溢出)

2条答案

相关问题

热门标签

最新问答