我想用python beautiful soup编写一个程序,使用带有anchor_text和hyperlink的csv文件来链接html中的单词
包含2列的CSV文件:
| 锚文本|超连结|
| - -| - -|
| 谷歌|https://www.google.com|
| 必应|https://bing.com|
| 雅虎|https://yahoo.com|
| 有效市场活动|https://activecampaign.com|
以下是HTML示例
<!-- wp:paragraph -->
<p>This is a existing link <a class="test" href="https://yahoo.com/">Yahoo</a> Text</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>This is another Google Text</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>This is another lowercase bing Text</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>This is another multi word Active Campaign Text</p>
<!-- /wp:paragraph -->
我希望输出为
<!-- wp:paragraph -->
<p>This is a existing link <a href="https://yahoo.com/">Yahoo</a> Text</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>This is another <a href="https://www.google.com/">Google</a> Text</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>This is another lowercase <a href="https://bing.com/">bing</a> Text</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>This is another multi word <a href="https://activecampaign.com/">Active Campaign</a> Text</p>
<!-- /wp:paragraph -->
这是我目前为止无法使用的代码,它删除了整个句子,并用一个超链接替换。
html_doc = """
<!-- wp:paragraph -->
<p>This is a existing link <a class="test" href="https://yahoo.com/">Yahoo</a> Text</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>This is another Google Text</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>This is another lowercase bing Text</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>This is another multi word Active Campaign Text</p>
<!-- /wp:paragraph -->
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# read the CSV file with anchor text and hyperlinks
with open('file.csv', 'r') as csv_file:
reader = csv.reader(csv_file)
hyperlinks = dict(reader)
# find all the text nodes in the HTML document
text_nodes = soup.find_all(text=True)
# iterate over the text nodes and replace the anchor text with hyperlinked text
for node in text_nodes:
for anchor_text, hyperlink in hyperlinks.items():
if anchor_text in node:
# create a new tag with the hyperlink
new_tag = soup.new_tag('a', href=hyperlink)
new_tag.string = anchor_text
# replace the original text node with the new one
node.replace_with(new_tag)
# save the modified HTML to a new file
with open('index_hyperlinked.html', 'w') as outfile:
outfile.write(str(soup))
print(soup)
1条答案
按热度按时间lh80um4z1#
我没有指定任何解析器--只是直接
soup = BeautifulSoup(html_doc)
;应该没什么区别,但我想我应该提一下以防万一。你应该尝试在外部循环中使用anchor/links,然后在内部循环中分解匹配的字符串:
打印输出:
即使在同一个字符串中有多个匹配项,只要它们不重叠(如 "Google Chrome" 和 "Chrome Beta"),也可以使用此方法。