tensorflow 从文件流中读取tar.gz压缩文件，解压缩并将其放入另一个文件流中，而不写入磁盘

x7yiwoj4 于 2022-11-16 发布在其他

关注(0)|答案(2)|浏览(187)

我有一个1GB的大目录，其中包含多个文件，这些文件以tar.gz格式存储在S3中，并由Lambda函数进行操作。
Lambda函数的文件系统是只读的。所以我希望操作在内存中完成。
我不能把它包含在Lambda函数本身的映像中，因为GitHub不接受这么大的文件。
让Lambda从S3读取它似乎是合理的，但是我不知道如何解压缩它。对不起，我是一个初学者。
下面是我写的：

# Define the resources to use
s3 = boto3.resource('s3', region_name='us-east-1')
bucket = s3.Bucket('tensorflow-models')
object = bucket.Object('saved-model.tar.gz')

# Prepare 2 file streams
file_stream1 = io.BytesIO()
file_stream2 = io.BytesIO()

# Download object to file stream
object.download_fileobj(file_stream1)

# Uncompress it
with tarfile.open(file_stream1, "r:gz") as tar:
    tar.extractall(file_stream2)

# Use it in Tensorflow
model = tf.keras.models.load_model(file_stream2)

# Get the result
result = model.call(embedded_sentences)

以下是错误消息：

{
  "errorMessage": "expected str, bytes or os.PathLike object, not BytesIO",
  "errorType": "TypeError",
  "requestId": "xxxxxxxxxxxxxxxxxxx",
  "stackTrace": [
    "  File \"/var/task/app.py\", line 87, in lambda_handler\n    with tarfile.open(file_stream1, \"r:gz\") as tar:\n",
    "  File \"/var/lang/lib/python3.9/tarfile.py\", line 1629, in open\n    return func(name, filemode, fileobj, **kwargs)\n",
    "  File \"/var/lang/lib/python3.9/tarfile.py\", line 1675, in gzopen\n    fileobj = GzipFile(name, mode + \"b\", compresslevel, fileobj)\n",
    "  File \"/var/lang/lib/python3.9/gzip.py\", line 173, in __init__\n    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')\n"
  ]
}

tensorflow

来源：https://stackoverflow.com/questions/70309520/read-a-tar-gz-compressed-file-form-a-file-stream-uncompress-it-and-put-it-in-an

2条答案

按热度按时间

hrysbysz1#

我认为您无法从lambda处理1Gb文件，因为其临时目录限制为512 MB（请检查https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html）
请检查大文件，将EFS装载到lambda或更改为逻辑（探索替代lambda的另一种可能的工作进程）

赞(0）回复(0）举报 2022-11-16

sxpgvts32#

一种可能性是使用lambda的RAM存储器，而不是使用临时存储器。参见Python的答案here。
基本上，你将在zip中逐个文件地解压缩，并一次将每个文件加载到内存中，而不是一次做所有的事情。源代码链接中的代码片段可以在这里找到，其中有一个单线程解决方案：

import boto3
import botocore
import tarfile

from io import BytesIO
s3_client = boto3.client('s3')

def untar_s3_file(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    input_tar_file = s3_client.get_object(Bucket = bucket, Key = key)
    input_tar_content = input_tar_file['Body'].read()

    with tarfile.open(fileobj = BytesIO(input_tar_content)) as tar:
        for tar_resource in tar:
            if (tar_resource.isfile()):
                inner_file_bytes = tar.extractfile(tar_resource).read()
                s3_client.upload_fileobj(BytesIO(inner_file_bytes), Bucket = bucket, Key = tar_resource.name)

赞(0）回复(0）举报 2022-11-16

我来回答

tensorflow 从文件流中读取tar.gz压缩文件，解压缩并将其放入另一个文件流中，而不写入磁盘

2条答案

相关问题

热门标签

最新问答