我正在抓取这个网站(https://www.bernama.com/en/crime_courts/),但是抓取的URL缺少https://www.bernama.com/en/,而它只得到news.php?id=2067755。我的目标是得到完整的URL https://www.bernama.com/en/news.php?id=2067755。这会导致请求URL中的ValueError。有什么方法可以防止这样的事情吗?
电流输出示例:
['https://www.bernama.com/bm/index.php', 'https://www.bernama.com/man/index.php',
'https://www.bernama.com/ar/index.php', 'https://www.bernama.com/es/index.php',
'https://www.bernama.com/tam/index.php', 'https://www.bernama.com/en/general/news_covid-19.php?id=2067618',
'https://www.bernama.com/en/general/news_covid-19.php?id=2067541',
'https://www.bernama.com/en/general/news_covid-19.php?id=2067539',
'https://www.bernama.com/en/general/news_covid-19.php?id=2066748',
'https://www.bernama.com/en/general/news_covid-19.php?id=2066575', 'news.php?id=2067925',
'news.php?id=2067925', 'news.php?id=2067916', 'news.php?id=2067912', 'news.php?id=2067854',
'news.php?id=2067842', 'news.php?id=2067804', 'news.php?id=2067767', 'news.php?id=2067758',
'news.php?id=2067755', 'https://www.youtube.com/watch?v=772iUlQuuBg']
我的代码:
start_urls = ["https://www.bernama.com/en/crime_courts/"]
def parse(self, response):
# Get only the news content instead of video content
sections = response.xpath('//div[@class="row"]/div[div[@class="row"]//span[contains(text(), "More news")]]')
for news in sections[0].css('h6 a'):
temp_title = news.css('::text').get()
temp_link = response.urljoin(news.css('::attr(href)').get())
request = scrapy.Request(temp_title,
callback = self.parse_details,
cb_kwargs = dict(title = temp_title))
request.cb_kwargs['link'] = temp_link
yield request
def parse_details(self, response, title, link):
text_right = response.css('div.text-right::text').getall()
item = NewsItem()
item['title'] = title
item['link'] = link
item['date'] = text_right[-1].split(" ")[0]
item['time'] = text_right[-1].split(" ")[1] + " " + text_right[-1].split(" ")[2]
item['location'] = response.css('p::text').get().split(",")[0]
yield item
1条答案
按热度按时间xu3bshqb1#