scrapy 报废值错误:请求URL中缺少方案

4jb9z9bj  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(175)

我正在抓取这个网站(https://www.bernama.com/en/crime_courts/),但是抓取的URL缺少https://www.bernama.com/en/,而它只得到news.php?id=2067755。我的目标是得到完整的URL https://www.bernama.com/en/news.php?id=2067755。这会导致请求URL中的ValueError。有什么方法可以防止这样的事情吗?
电流输出示例:

['https://www.bernama.com/bm/index.php', 'https://www.bernama.com/man/index.php', 
 'https://www.bernama.com/ar/index.php', 'https://www.bernama.com/es/index.php',
 'https://www.bernama.com/tam/index.php', 'https://www.bernama.com/en/general/news_covid-19.php?id=2067618', 
'https://www.bernama.com/en/general/news_covid-19.php?id=2067541', 
        'https://www.bernama.com/en/general/news_covid-19.php?id=2067539', 
        'https://www.bernama.com/en/general/news_covid-19.php?id=2066748', 
        'https://www.bernama.com/en/general/news_covid-19.php?id=2066575', 'news.php?id=2067925', 
        'news.php?id=2067925', 'news.php?id=2067916', 'news.php?id=2067912', 'news.php?id=2067854', 
        'news.php?id=2067842', 'news.php?id=2067804', 'news.php?id=2067767', 'news.php?id=2067758', 
        'news.php?id=2067755', 'https://www.youtube.com/watch?v=772iUlQuuBg']

我的代码:

start_urls = ["https://www.bernama.com/en/crime_courts/"]

def parse(self, response):
    # Get only the news content instead of video content
    sections = response.xpath('//div[@class="row"]/div[div[@class="row"]//span[contains(text(), "More news")]]')    

    for news in sections[0].css('h6 a'):  
        temp_title = news.css('::text').get()
        temp_link = response.urljoin(news.css('::attr(href)').get())

        request = scrapy.Request(temp_title, 
                                callback = self.parse_details, 
                                cb_kwargs = dict(title = temp_title))
        request.cb_kwargs['link'] = temp_link

        yield request

    def parse_details(self, response, title, link):
    text_right = response.css('div.text-right::text').getall()

    item = NewsItem()
    item['title'] = title
    item['link'] = link
    item['date'] = text_right[-1].split(" ")[0]
    item['time'] = text_right[-1].split(" ")[1] + " " + text_right[-1].split(" ")[2]
    item['location'] = response.css('p::text').get().split(",")[0]

    yield item
xu3bshqb

xu3bshqb1#

allowed_domains = ["www.bernama.com"]
start_urls = ["https://www.bernama.com/en/crime_courts/"]

def parse(self, response,**kwargs):
    # Get only the news content instead of video content
    sections = response.xpath('//div[@class="row"]/div[div[@class="row"]//span[contains(text(), "More news")]]')

    for news in sections.css('h6 a'):
        temp_title = news.css('::text').get()
        temp_link = response.urljoin(news.css('::attr(href)').get())

        request = response.follow(url=temp_link,
                                  callback=self.parse_details, cb_kwargs=dict(title=temp_title, link=temp_link))

        yield request

    def parse_details(self, response, title, link):
        text_right = response.css('div.text-right::text').getall()

        item = NewsItem()
        item['title'] = title
        item['link'] = link
        item['date'] = text_right[-1].split(" ")[0]
        item['time'] = text_right[-1].split(" ")[1] + " " + text_right[-1].split(" ")[2]
        item['location'] = response.css('p::text').get().split(",")[0]

    yield item

相关问题