selenium 如何解析标签外的文本

sqyvllje  于 2022-11-10  发布在  其他
关注(0)|答案(1)|浏览(180)

我正在分析一个文本,它的每个单词都被做成了一个链接。问题是,标点符号不是标签<a>的内容,它们只是位于标签之外,所以我不知道怎么做才能得到标点符号。

<table>
  <tbody>
    <tr>
      <td>
        <a href="#">Lorem</a>
        ", "
        <a href="#">Ipsum</a>
        ": "
        <a href="#">dolor</a>
        "."
      </td>
      <td>...</td>
    </tr>
    <tr>
      <td>
        <a href="#">sit</a>
        "? '"
        <a href="#">amet</a>
        "' "
        <a href="#">consectetur</a>
        "..."
      </td>
      <td>...</td>
    </tr>
    <tr>
      <td>
        <a href="#">adipisicing</a>
        "-"
        <a href="#">elit</a>
        "; "
        <a href="#">Molestias</a>
        "!"
      </td>
      <td>...</td>
    </tr>
  </tbody>
</table>

这是解析器

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')

words = []
for tableRows in soup.select("table > tbody > tr"):
  for word in tableRows.find("td").select("a"):
    words.append(word.text)

print(words)
6l7fqoea

6l7fqoea1#

a标记元素之间的文本内容属于父td元素本身。
您可以直接从td元素获取文本,如下所示:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')

words = []
for tableRow in soup.select("table > tbody > tr"):
  words.append(tableRow.text)

print(words)

更新

如果您希望将标点符号作为单独的对象,您可以用空格拆分表格行文本。下面的代码应该删除前导空格和尾随空格。

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')

words = []
for tableRow in soup.select("table > tbody > tr"):
  tableRowtext = tableRow.text
  rowTexts = [x.strip() for x in tableRowtext.split(' ')]
  words.append(rowTexts)

print(words)

相关问题