python 使用Selenium读取HTML表的所有值

vohkndzv  于 2023-02-28  发布在  Python
关注(0)|答案(2)|浏览(201)

我尝试读取下面html表中的所有元素并将其转换为 Dataframe ,但所有数值都没有被我的get_attribute函数记录。我也尝试使用.get_attribute('td').get_attribute('tr').get_attribute('outerHTML'),但仍然得到下面的结果。我尝试使用以下代码

bond_totals_table = driver.find_element(By.XPATH,'/html/body/form[2]/table/tbody/tr/td/table/body').get_attribute('td')
bond_totals_table = pd.read_html(bond_totals_table, flavor = 'bs4')
0   Increment Number    Action  Current Acres   Add Delete  Acres for Calculation   Adjusted Amount Status  Bond?
1   NaN Existing Modify New Closed Reactivate Reconcile NaN NaN NaN NaN NaN ACT INA PH1 PH2 PH3 TRM Yes No
2   NaN Existing Modify New Closed Reactivate Reconcile NaN NaN NaN NaN NaN ACT INA PH1 PH2 PH3 TRM Yes No
3   NaN Existing Modify New Closed Reactivate Reconcile NaN NaN NaN NaN NaN ACT INA PH1 PH2 PH3 TRM Yes No
4   NaN Existing Modify New Closed Reactivate Reconcile NaN NaN NaN NaN NaN ACT INA PH1 PH2 PH3 TRM Yes No
5   NaN Existing Modify New Closed Reactivate Reconcile NaN NaN NaN NaN NaN ACT INA PH1 PH2 PH3 TRM Yes No
6   NaN Existing Modify New Closed Reactivate Reconcile NaN NaN NaN NaN NaN ACT INA PH1 PH2 PH3 TRM Yes No
7   NaN Existing Modify New Closed Reactivate Reconcile NaN NaN NaN NaN NaN ACT INA PH1 PH2 PH3 TRM Yes No

看起来表格曾经是可调的,但现在不是了,get attribute函数不知何故没有得到灰色单元格中的显示值。

b1uwtaje

b1uwtaje1#

要读取HTML表的所有值,您需要将visibility_of_element_located()的诱导WebDriverWait的 * <table> * 元素作为目标,并提取 * outerHTML *,如下所示:

import pandas as pd

bond_totals_table_data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//html/body/form[2]/table/tbody/tr/td/table"))).get_attribute('outerHTML')
bond_totals_table = pd.read_html(bond_totals_table_data)
print(bond_totals_table)

参考文献

您可以在以下内容中找到一些相关的详细讨论:

6pp0gazn

6pp0gazn2#

你可以使用Beautiful Soup w/ Panda,下面是一个从CDC表格中阅读的例子:

with webdriver.Firefox() as browser:
    browser.get("https://www.cdc.gov/nchs/nhis/shs/tables.htm")
    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")
    tbl = soup.select_one("#example")
    df = pd.read_html(str(tbl))
    print(df[0])

相关问题