python 对boto3 s3 upload_fileobj使用多进程会导致SSL错误

0sgqnhkj  于 2022-12-21  发布在  Python
关注(0)|答案(1)|浏览(205)

在运行时为python3.9和boto3 - 1.20.32的AWS Lambda中,我运行以下代码:

s3_client = boto3.client(service_name="s3")
s3_bucket = "bucket"
s3_other_bucket = "other_bucket"

def multiprocess_s3upload(tar_index: dict):

    def _upload(filename, bytes_range):

        src_key = ...

        # get single raw file in tar with bytes range
        s3_obj = s3_client.get_object(
            Bucket=s3_bucket,
            Key=src_key,
            Range=f"bytes={bytes_range}"
        )

        # upload raw file
        # error occur !!!!!
        s3_client.upload_fileobj(
            s3_obj["Body"],
            s3_other_bucket,
            filename
        )

    def _wait(procs):
        for p in procs:
            p.join()
    
    processes = []
    proc_limit = 256  # limit concurrent processes to avoid "open too much files" error
    for filename, bytes_range in tar_index.items():
        # filename = "hello.txt"
        # bytes_range = "1024-2048"
       
        proc = Process(
            target=_upload,
            args=(filename, bytes_range)
        )
        proc.start()
        processes.append(proc)
        
        if len(processes) == proc_limit:
            _wait(processes)
            processes = []

    _wait(processes)

这个程序是从一个s3存储桶中的tar文件中提取部分原始文件,然后将每个原始文件上传到另一个s3存储桶中,一个tar文件中可能有数千个原始文件,所以我使用多进程来加快s3上传操作。
而且,我在一个关于SSLError的子进程中随机处理同一个tar文件时遇到了异常。我尝试了不同的tar文件,得到了相同的结果。只有最后一个子进程抛出了异常,其余的子进程工作正常。

Process Process-2:
Traceback (most recent call last):
File "/var/runtime/urllib3/response.py", line 441, in _error_catcher
  yield
File "/var/runtime/urllib3/response.py", line 522, in read
  data = self._fp.read(amt) if not fp_closed else b""
File "/var/lang/lib/python3.9/http/client.py", line 463, in read
  n = self.readinto(b)
File "/var/lang/lib/python3.9/http/client.py", line 507, in readinto
  n = self.fp.readinto(b)
File "/var/lang/lib/python3.9/socket.py", line 704, in readinto
  return self._sock.recv_into(b)
File "/var/lang/lib/python3.9/ssl.py", line 1242, in recv_into
  return self.read(nbytes, buffer)
File "/var/lang/lib/python3.9/ssl.py", line 1100, in read
  return self._sslobj.read(len, buffer)
ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2633)

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lang/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
  self._target(*self._args, **self._kwargs)
File "/var/task/main.py", line 144, in _upload
  s3_client.upload_fileobj(
File "/var/runtime/boto3/s3/inject.py", line 540, in upload_fileobj
  return future.result()
File "/var/runtime/s3transfer/futures.py", line 103, in result
  return self._coordinator.result()
File "/var/runtime/s3transfer/futures.py", line 266, in result
  raise self._exception
File "/var/runtime/s3transfer/tasks.py", line 269, in _main
  self._submit(transfer_future=transfer_future, **kwargs)
File "/var/runtime/s3transfer/upload.py", line 588, in _submit
  if not upload_input_manager.requires_multipart_upload(
File "/var/runtime/s3transfer/upload.py", line 404, in requires_multipart_upload
  self._initial_data = self._read(fileobj, threshold, False)
File "/var/runtime/s3transfer/upload.py", line 463, in _read
  return fileobj.read(amount)
File "/var/runtime/botocore/response.py", line 82, in read
  chunk = self._raw_stream.read(amt)
File "/var/runtime/urllib3/response.py", line 544, in read
  raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/var/lang/lib/python3.9/contextlib.py", line 137, in __exit__
  self.gen.throw(typ, value, traceback)
File "/var/runtime/urllib3/response.py", line 452, in _error_catcher
  raise SSLError(e)

urllib3.exceptions.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2633)

根据这个10年前类似的问题Multi-threaded S3 download doesn't terminate,根本原因可能是boto3s3上传使用了一个非线程安全的库来发送http请求,但是,这个解决方案对我不起作用。
我发现了一个关于我的问题的boto3 issue。这个问题已经消失了,作者没有做任何改变。
其实这个问题最近已经自行消失了,我也没有(!)做任何的改变,我想这个问题是亚马逊制造并修复的,我只是怕它会再次出现...
有人知道怎么修吗?

h22fl7wq

h22fl7wq1#

根据boto3关于多处理的文档(doc),
资源示例不是线程安全的,不应跨线程或进程共享。这些特殊类包含无法共享的附加 meta数据。建议为每个线程或进程创建一个新资源:
我修改过的代码,

def multiprocess_s3upload(tar_index: dict):

    def _upload(filename, bytes_range):

        src_key = ...

        # get single raw file in tar with bytes range
        s3_client = boto3.client(service_name="s3")   # <<<< one clien per thread
        s3_obj = s3_client.get_object(
            Bucket=s3_bucket,
            Key=src_key,
            Range=f"bytes={bytes_range}"
        )

        # upload raw file
        s3_client.upload_fileobj(
            s3_obj["Body"],
            s3_other_bucket,
            filename
        )

    def _wait(procs):
        ...
    
    ...

似乎未发生SSL错误异常。

相关问题