我试图抓取页面并打印hrefs
,但我没有得到任何响应.这里是蜘蛛-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Shoes2Spider(CrawlSpider):
name = "shoes2"
allowed_domains = ["stockx.com"]
start_urls = ["https://www.stockx.com/sneakers"]
rules = (
Rule(
LinkExtractor(restrict_xpaths = "//div[@class='css-pnc6ci']/a"),
callback = "parse_item",
follow = True
),
)
def parse_item(self, response):
print(response.url)
这里还有一个我试图提取的hrefs
的例子-x1c 0d1x
当我运行spider时,我期望看到40 hrefs
,但是我什么也没有得到。我做错了什么?
这里也是在终端中创建项目的代码-
scrapy startproject stockx
cd stockx
scrapy genspider -t crawl shoes2 www.stockx.com/sneakers
2条答案
按热度按时间4c8rllxm1#
所以我刚刚意识到
start_urls
被设置为["https://www.stockx.com"]
。我把它改成了["https://www.stockx.com/sneakers"]
,这似乎解决了这个问题。2wnc66cl2#
这是将工作.我已经更新了CSS
restrict_css='[data-testid="RouterSwitcherLink"]'