Azure SDK for Python批量读取blob数据

m2xkgtsf  于 2022-12-14  发布在  Python
关注(0)|答案(1)|浏览(130)

我正在从一个大小大约为5 GB的blob中阅读数据。我通常处理大小为500 MB的数据。所以,我尝试在多次迭代中读取较小的数据块,例如300 MB。有没有一种方法可以完成这一点,而不是执行readall(),而是以较小的增量读取数据?

blob_client = BlobClient(blob_service_client.url,
                         container_name,
                         blob_name,
                         credential)

data_stream = blob_client.download_blob()
data = data_stream.readall()

如何将下面的chunks与上面的BlobServiceClient一起使用

# This returns a StorageStreamDownloader.
   stream = source_blob_client.download_blob()
   block_list = []

   # Read data in chunks to avoid loading all into memory at once
   for chunk in stream.chunks():
       # process your data (anything can be done here really. `chunk` is a byte array).
       block_id = str(uuid.uuid4())
       destination_blob_client.stage_block(block_id=block_id, data=chunk)
       block_list.append(BlobBlock(block_id=block_id))
jv4diomz

jv4diomz1#

我在我的环境中尝试,得到以下结果:

如何将下面的chunks与上面的BlobServiceClient一起使用

代码:

from  azure.storage.blob  import  BlobServiceClient, BlobBlock

import  uuid

connection_string="storage connection string"
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client('test1')
blob_client = container_client.get_blob_client("file.pdf")
#upload data
block_list=[]
chunk_size=4*1024*1024
with  open("C:\\Users\\****\\****\\sample12 (2).pdf",'rb') as  f:
while  True:
read_data = f.read(chunk_size)
if  not  read_data:
break  # done
blk_id = str(uuid.uuid4())
blob_client.stage_block(block_id=blk_id,data=read_data)
block_list.append(BlobBlock(block_id=blk_id))

blob_client.commit_block_list(block_list)

若要上传每个区块,您可以使用**BlobClient.stage_block方法。上传之后,我们会使用BlobClient.commit_block_list**方法将所有区块合并成单一Blob。

控制台:

门户网站:

您还可以在两个容器之间引用另一个方法,用于Jim Xu编写的块SO-thread

相关问题