在Scrapy中,我如何从列表中删除空值并将列表合并成一个字符串(如段落)?

elcex8rz  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(138)

我是新的scrapy无法找到一个适当的解决方案,我试图得到一个完美的段落,但无法做到这一点,我得到了一个列表,其中包含一些空值,如“”我如何才能删除他们在scrapy使用itemloader?我已经尽了最大的努力
这是我代码

import scrapy
from scrapy.loader import ItemLoader
from ..items import RcgroupsItem

class RcgroupSpider(scrapy.Spider):
    name = 'rcgroup'
    allowed_domains = ['rcgroups.com']
    start_urls = ['https://www.rcgroups.com/forums/showthread.php?2911378-DJI-Dashboard-Modding-tips-tricks-and-results-OFFICIAL-THREAD/page2']

    def parse(self, response):
        cards = response.xpath("//div[@id='posts']/div[@align='center']")
        for card in cards:

            loader = ItemLoader(item=RcgroupsItem(), selector=card)
            loader.add_xpath('number', ".//div[@class='thead_postbit_right']//a//text()")
            loader.add_xpath('date', (".//div[@class='thead_postbit_left']/span/text()[1]"))
            loader.add_xpath('name', ".//div[@class='postbit-name']/a/text()")
            loader.add_xpath('post', (".//div[@class='postbit-content']/text()"))
            loader.add_xpath('reply', (".//div[@class='postbit-content']/div//text()"))
            yield loader.load_item()

这里是我的item.py

import scrapy
from scrapy.loader.processors import TakeFirst, MapCompose, Join
from w3lib.html import remove_tags

def normalize_space(value):
    lst=  " ".join(value.split())
    return lst      

class RcgroupsItem(scrapy.Item):
    number = scrapy.Field(
        output_processor= TakeFirst()
    )
    date = scrapy.Field(
        input_processor = MapCompose(normalize_space),
        output_processor= TakeFirst()
    )
    name = scrapy.Field(
        output_processor= TakeFirst()
    )
    post = scrapy.Field(
        input_processor = MapCompose(normalize_space)
    )
    reply = scrapy.Field(
        input_processor = MapCompose(normalize_space)   
    )

这里是setting.py

BOT_NAME = 'rcgroups'

SPIDER_MODULES = ['rcgroups.spiders']
NEWSPIDER_MODULE = 'rcgroups.spiders'

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

FEED_EXPORT_ENCODING= 'utf-8' 

FEEDS = {
    'output': {
        'format': 'csv',
    }
}

我得到的post输出是

'post': ['',
          'Quad808,',
          '',
          "I think Mad genuinely be pilots decide on the "
          'wisdom of the CopterSafehe's on.",
          '',
          "He's a in all the DJI threads... expect him to be "
          'one here also.',
          '',
          'P.S. Drop me a PM....'],

如何删除空值并将其转换为正确的字符串?

41zrol4v

41zrol4v1#

请尝试:

post = scrapy.Field(
        input_processor = MapCompose(str.strip())
    )

相关问题