python 如何基于可能跨越多个子标签的字符串找到元素？

wi3ka0sx 于 2024-01-05 发布在 Python

关注(0)|答案(3)|浏览(133)

我试图根据已知的文本字符串来识别文档中的特定元素。

soup.find(string=re.compile(".*some text string.*"))

字符串
然而，已知字符串中可能有（多个）子元素。例如，如果这是我们的文档：

test_doc = BeautifulSoup("""<html><h1>Title</h1><p>Some <b>text</b></p>""")

型
我正在寻找一个特定的元素。关于这个元素，我唯一知道的是它包含文本“Some text”。我 * 不 * 知道其中的单词“text”在一个子粗体标记中。

test_doc.find(string=re.compile(".*Some text.*"))

型
是None，因为“text”在子标记中。
如果我不知道文本是否/如何分解成子标签，我如何返回父标签（在我的示例中是p标签）和所有子标签？

python

来源：https://stackoverflow.com/questions/77748700/how-to-find-element-based-on-string-that-may-span-multiple-child-tags

3条答案

按热度按时间

kupeojn61#

我的第一个想法是，在不清楚哪些或多少个标签可以嵌套的背景下，这里是css selector和伪类:-soup-contains("some text")，但这可能超过了标记，因为它还返回所有包含文本的重叠组合。
当然，这不是最好的甚至是最有弹性的方法，但也许可以从中找到一个解决方案，那就是在每种情况下挑选出容纳文本的最小元素组合：

from bs4 import BeautifulSoup 
test_doc = BeautifulSoup("""<html><h1>Title</h1><p>Some <b>text</b></p><div><p>Some <i>text</i> different than <div>before</div></p></div>""", 'html.parser')
selection = test_doc.select(':-soup-contains("Some text")')
for i,el in enumerate(selection):
    if len(selection[i].find_all()) <len(selection[i-1].find_all()):
           del selection[i-1]
print(selection)

字符串
结果是：

[<p>Some <b>text</b></p>, <p>Some <i>text</i> different than <div>before</div></p>]

型
另一种选择是，如果可以识别出一组阻碍您实际方法的标记，则首先将其unwrap()-认为这也是@Andrej Kesely要求一些特定标记的原因。

展开查看全部

赞(0）回复(0）举报 2024-01-05

2o7dmzc52#

另一个解决方案，灵感来自@HedgeHog的回答：

from bs4 import BeautifulSoup
test_doc = BeautifulSoup(
    """<html><h1>Title</h1><p>Some <b>text</b></p><div><p>Some <i>text</i> different than <div>before</div></p></div>""",
    "html.parser",
)
tags = test_doc.find_all(lambda tag: "Some text" in tag.text)
out = []
while tags and (t := tags.pop()):
    while tags and t in tags[-1]:
        tags.pop()
    out.append(t)
print(out)

字符串
印刷品：

[<p>Some <i>text</i> different than <div>before</div></p>, <p>Some <b>text</b></p>]

型

展开查看全部

赞(0）回复(0）举报 2024-01-05

pengsaosao3#

下面是使用lxml和xpath的方法，它也涵盖了预期文本包含在单个节点中的情况。

from lxml import etree
xml = """<html><h1>Title</h1>
    <div id="target">
        <div>Some <div><div><span><b>text</b></span></div></div></div>
        <div>Some <b>another text</b></div>
        <p>Some <i>text</i> different than <div>before</div></p>
        <em>Some text</em>
    </div>
</html>"""
root = etree.fromstring(xml)
ele = root.xpath('//div[@id="target"]//*[(./text()="Some " and .//*[1]/text()="text") or ./text()="Some text"]')
print(ele)

字符串
.//*[1]/text()="text"]查找包含预期字符串的上下文节点的第一个后代。它区分大小写，因此./text()="some "不会找到任何内容。
给定样品的结果

[<Element div at 0x7f2d65eef6c0>, <Element p at 0x7f2d65eef700>, <Element em at 0x7f2d65eef740>]

型
从找到的元素中提取内容

print([[t for t in e.xpath('descendant-or-self::text()')] for e in ele])

型
结果

[['Some ', 'text'], ['Some ', 'text', ' different than ', 'before'], ['Some text']]

型

展开查看全部

赞(0）回复(0）举报 2024-01-05

我来回答

python 如何基于可能跨越多个子标签的字符串找到元素？

3条答案

相关问题

热门标签

最新问答