pandas 无法使用Beautifulsoup从本地html文件中抓取数据

f1tvaqid  于 2023-01-24  发布在  其他
关注(0)|答案(2)|浏览(113)

我厌倦了使用Beautifulsoup从本地可用的html文件(下面提供了下载链接)中抓取表数据行,但没有任何成功:
以下是我的努力:

from bs4 import BeautifulSoup
import json

with open("web_summary.html", "r") as file:
    html_file = file.read()

soup = BeautifulSoup(html_file, "html.parser")

script = soup.find("div", {"data-component": "CellRangerSummary", "data-key": "summary"}).find('script')
table_data = json.loads(script.text.split('=')[1], encoding='utf-8')
summary_data = table_data['summary']
summary_tab = summary_data['summary_tab']

rows = summary_tab['table']['rows']

for row in rows:
    print(row[0],row[1])

html file download link
以下是作为 Dataframe 的预期输出(所有表的行):

Number of Spots Under Tissue    2,987
Mean Reads per Spot 128,583
Median Genes per Spot   4,553
Number of Reads 384,076,450
Valid Barcodes  97.70%
Valid UMIs  99.90%
Sequencing Saturation   80.20%
Q30 Bases in Barcode    98.90%
Q30 Bases in RNA Read   89.60%
Q30 Bases in UMI    98.80%
Reads Mapped to Genome  86.00%
Reads Mapped Confidently to Genome  79.10%
Reads Mapped Confidently to Intergenic Regions  5.20%
Reads Mapped Confidently to Intronic Regions    0.00%
Reads Mapped Confidently to Exonic Regions  73.90%
Reads Mapped Confidently to Transcriptome   65.60%
Reads Mapped Antisense to Gene  1.40%
Fraction Reads in Spots Under Tissue    97.30%
Mean Reads per Spot 128,583
Median Genes per Spot   4,553
Total Genes Detected    21,673
Median UMI Counts per Spot  14,169

有什么想法(Beautifulsoup或任何其他框架)使我的代码工作?

dced5bon

dced5bon1#

您要查找的表格内容在特定的表中并不整齐;相反,它们出现在脚本标记中偶尔出现的不同表中。我建议的脚本尝试从不同的表中获取所有数据。然而,使用您开始使用的方法的最接近的可能解决方案是:

from bs4 import BeautifulSoup
import requests
import json

link = 'https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_visium_data/rawdata/ST8059048/web_summary.html'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text, "html.parser")

script = soup.find("div", {"data-component":"CellRangerSummary", "data-key":"summary"}).find('script')
table_data = json.loads(script.contents[0].strip().split('const data = ')[1])
summary_data = table_data['summary']
for item,val in summary_data['summary_tab'].items():
    if not val.get('table'): continue
    rows = val['table']['rows']

    for row in rows:
        print(row[0],row[1])

输出:

Number of Reads 384,076,450
Valid Barcodes 97.7%
Valid UMIs 99.9%
Sequencing Saturation 80.2%
Q30 Bases in Barcode 98.9%
Q30 Bases in RNA Read 89.6%
Q30 Bases in UMI 98.8%
Fraction Reads in Spots Under Tissue 97.3%
Mean Reads per Spot 128,583
Median Genes per Spot 4,553
Total Genes Detected 21,673
Median UMI Counts per Spot 14,169
Reads Mapped to Genome 86.0%
Reads Mapped Confidently to Genome 79.1%
Reads Mapped Confidently to Intergenic Regions 5.2%
Reads Mapped Confidently to Intronic Regions 0.0%
Reads Mapped Confidently to Exonic Regions 73.9%
Reads Mapped Confidently to Transcriptome 65.6%
Reads Mapped Antisense to Gene 1.4%
6rqinv9w

6rqinv9w2#

Pandas有一个适用于您的案例的read_html

import pandas as pd

#the sequencing/mapping/spots/sample tables are separate, concat them
df = pd.concat(pd.read_html('web_summary.html'))
df.columns = ['field','value']
print(df)

相关问题