scrapy 如何使用url的dataframe作为scrappy中start_urls的源代码

slsn1g29  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(139)

我有一个scraper,它可以很好的使用csv文件作为开始url。我有另外两个脚本,它们可以检索目标名称,然后找到api UUID,这样我就可以把我收集到的所有url刮到一个Pandas Dataframe 中。现在我正在把我的url从清理url的 Dataframe 输出到csv文件中;然后将该csv文件导入到我的scrapy脚本中,以获取我的数据,然后是另一个输出csv文件。
我希望能够连接已经包含URL的 Dataframe ,而不是创建一个csv并将其重新读入我的脚本。

我的数据框架

Data_List  = [list of URLS]

df_api_data = pd.DataFrame(Data_List)
api_file_name = 'data_apis_' + tm + '.csv'
path = r'1_wiki_apis/'
df_api_data.to_csv(path + api_file_name, header=None, index=False)

蹩脚的脚本

lass DataCrawlerSpider(scrapy.Spider):
name = 'data_crawler'

# Import clean list of api urls from csv

filepath = 'api_urls/api_urls_' + tm + '.csv'
custom_settings = {
    'CONCURRENT_REQUESTS': 32,
    'FEEDS': {'final_data/raw_data_' + tm + '.csv': {'format': 'csv'}},

# Open csv file and iterate through each row

with open(filepath) as file:
    start_urls = [line.strip() for line in file]
    POI_URLS = start_urls

# Callback to handle query list

def start_request(self):
    request = Request(url=self.start_urls, callback=self.parse)

# yield request and parse scraped data

def parse(self, response):
    # Xpath variables
    POI_DOD = response.xpath(get_dod).get()
    POI_GENDER = response.xpath(get_gender).get()
    # Get the wikidata response
    yield {
        'DISPLAY_CODE': '(NULL)',
        '_DOD': _DOD,
        '_GENDER': _GENDER,

        # NEW DATA COLUMNS
        '_INDUSTRY': _INDUSTRY,
        '_EDUCATION': _EDUCATION,
        '_EXCERPT': _EXCERPT,
    }

# pass

# Execute script to crawl Wiki database and get POI Data

process = CrawlerProcess()
process.crawl(DataCrawlerSpider)
process.start()

如何在start_urls[]中使用我的df_api_data Dataframe 完整API?

wko9yo5t

wko9yo5t1#

只需将Data_List直接插入spider中的start_requests方法,并为列表中的每一项生成一个请求。
如果你必须使用 Dataframe ,那么你所需要做的就是把一个带有url的列提取到一个序列中,然后调用series.to_list(),做同样的事情。

from scrapy import Request
import scrapy
from scrapy.crawler import CrawlerProcess

Data_List  = [list of URLS]

# or with a dataframe:

Data_List = dataframe["column with the urls"].to_list()

class DataCrawlerSpider(scrapy.Spider):
    name = 'data_crawler'

    custom_settings = {
        'CONCURRENT_REQUESTS': 32,
        'FEEDS': {'final_data/raw_data_' + tm + '.csv': {'format': 'csv'}},

    def start_request(self):
        for url in Data_List:
            yield Request(url)

    def parse(self, response):
        POI_DOD = response.xpath(get_dod).get()
        POI_GENDER = response.xpath(get_gender).get()
        yield {
            'DISPLAY_CODE': '(NULL)',
            '_DOD': _DOD,
            '_GENDER': _GENDER,
            '_INDUSTRY': _INDUSTRY,
            '_EDUCATION': _EDUCATION,
            '_EXCERPT': _EXCERPT,
        }

process = CrawlerProcess()
process.crawl(DataCrawlerSpider)
process.start()

相关问题