我正在使用表单识别器阅读PDF文件。将其存储在“result”变量/对象中。根据表单识别器文档中为Azure Databricks/pyspark提供的语法,我的输出如下所示。相反,我需要将输出放入 Dataframe 中。每个表都放入单独的 Dataframe 中。请对语法提出建议。提前感谢。
with open(formUrl, "rb") as f:
poller = document_analysis_client.begin_analyze_document("prebuilt-layout", document =f)
result = poller.result()
for table_idx, table in enumerate(result.tables):
print(
"Table # {} has {} rows and {} columns".format(
table_idx, table.row_count, table.column_count
)
)
for cell in table.cells:
print(
"...Cell[{}][{}] has content '{}'".format(
cell.row_index,
cell.column_index,
cell.content.encode("utf-8"),
)
)
输出量
Table # 0 has 3 rows and 7 columns
...Cell[0][0] has content 'b'BIOMARKER''
...Cell[0][1] has content 'b'METHOD|''
...Cell[0][2] has content 'b'ANALYTE''
...Cell[0][3] has content 'b'RESULT''
...Cell[0][4] has content 'b'THERAPY ASSOCIATION''
...Cell[0][6] has content 'b'BIOMARKER LEVELE''
...Cell[1][0] has content 'b'''
...Cell[1][1] has content 'b'IHC''
...Cell[1][2] has content 'b'Protein''
...Cell[1][3] has content 'b'Negative | 0''
...Cell[1][4] has content 'b'LACK OF BENEFIT''
...Cell[1][5] has content 'b'alectinib, brigatinib''
...Cell[1][6] has content 'b'Level 1''
...Cell[2][0] has content 'b'ALK''
...Cell[2][1] has content 'b'Seq''
...Cell[2][2] has content 'b'RNA-Tumor''
...Cell[2][3] has content 'b'Fusion Not Detected''
...Cell[2][5] has content 'b'ceritinib''
...Cell[2][6] has content 'b'Level 1''
...Cell[3][1] has content 'b'''
...Cell[3][2] has content 'b'''
...Cell[3][3] has content 'b'''
...Cell[3][5] has content 'b'crizotinib''
...Cell[3][6] has content 'b'Level 1''
Table # 1 has 3 rows and 4 columns
...Cell[0][0] has content 'b'''
...Cell[0][1] has content 'b'''
...Cell[0][2] has content 'b'''
...Cell[0][3] has content 'b'''
...Cell[1][0] has content 'b'NTRK1/2/3''
...Cell[1][1] has content 'b'Seq''
...Cell[1][2] has content 'b'RNA-Tumor''
...Cell[1][3] has content 'b'Fusion Not Detected''
...Cell[2][0] has content 'b'Tumor Mutational Burden''
...Cell[2][1] has content 'b'Seq''
...Cell[2][2] has content 'b'DNA-Tumor''
...Cell[2][3] has content 'b'High | 19 Mutations/ Mb''
1条答案
按热度按时间axr492tv1#
我尝试使用azure表单识别器读取PDF文档,并使用azure数据库将其转换为 Dataframe ,以下是详细步骤
1.软件包安装
2.连接到Azure存储容器
3.启用认知服务
在代码中,我们提供所创建的表单识别器的键值和端点,而不是cognitiveServicesEndpoint、cognitiveServicesKey
x1c4d 1x指令集
4.将文件发送到认知服务并转换为Dataframe
5.将结果的 Dataframe 上载到Azure
由于安全问题,我使用了相关的替换名称来代替实际名称StorageAccountName =最初创建的存储帐户名称,OutputContainer =为存储输出文件创建的约束器,formreckey =表单识别器的密钥
指令集