无法通过Scrapy选择器从书签获取所有链接

6rvt4ljy 于 2022-11-09 发布在其他

关注(0)|答案(1)|浏览(130)

我从chrome导出一个书签，想用Scrapy Selector来获取所有的链接，但是我只能获取部分链接（650个中的250个）
这是我代码

html = r'C:\Users\super\Downloads\Desktop\temp\html\bookmarks_9_13_22.html'
xpath = r'//@href'
with open(html, 'rb') as f:
    source = f.read()
target = Selector(text=source).xpath(xpath).getall()
print(len(target))

我做错了什么吗？我对Scrapy和XPath还不熟悉。
这里是bookmark (html file)

scrapy

来源：https://stackoverflow.com/questions/73703551/cant-obtain-all-links-from-bookmark-by-scrapy-selector

1条答案

按热度按时间

z3yyvxxp1#

似乎是文件中未闭合的<DT>标记导致了此问题。
解决方案：删除<DT>标记

from scrapy.selector import Selector
html = open(r'C:\Users\super\Downloads\Desktop\temp\html\bookmarks_9_13_22.html', 'rb').read()
root = Selector(text=html.replace(b'<DT>',b''))
final = root.xpath("//@href").getall()
print(len(final))  # <--- 650

赞(0）回复(0）举报 2022-11-09

我来回答

无法通过Scrapy选择器从书签获取所有链接

1条答案

相关问题

热门标签

最新问答