scrapy 脚本调用蜘蛛时如何给蜘蛛添加随机用户代理?

ttp71kqs  于 2023-01-13  发布在  其他
关注(0)|答案(1)|浏览(186)

我想在每个被其他脚本调用的蜘蛛请求中添加随机用户代理。我的实现如下:
CoreSpider.py

from scrapy.spiders import Rule
import ContentHandler_copy 

class CoreSpider(scrapy.Spider):
name = "final"
def __init__(self):
    self.start_urls = self.read_url()
    self.rules = (
        Rule(
            LinkExtractor(
                unique=True,
            ),
            callback='parse',
            follow=True
        ),
    )

def read_url(self):
    urlList = []
    for filename in glob.glob(os.path.join("/root/Public/company_profiler/seed_list", '*.list')):
        with open(filename, "r") as f:
            for line in f.readlines():
                url = re.sub('\n', '', line)
                if "http" not in url:
                    url = "http://" + url
                # print(url)
                urlList.append(url)

    return urlList

def parse(self, response):
    print("URL is: ", response.url)
    print("User agent is : ", response.request.headers['User-Agent'])
    filename = '/root/Public/company_profiler/crawled_page/%s.html' % response.url
    article = Extractor(extractor='LargestContentExtractor', html=response.body).getText()
    print("Article is :", article)
    if len(article.split("\n")) < 5:
        print("Skipping to next url : ", article.split("\n"))
    else:
        print("Continue parsing: ", article.split("\n"))
        ContentHandler_copy.ContentHandler_copy.start(article, response.url)

我从www.example.com提供的如下脚本运行此爬行器RunSpider.py

from CoreSpider import CoreSpider
from scrapy.crawler import  CrawlerProcess


process = CrawlerProcess()
process.crawl(CoreSpider())
process.start()

它的工作很好,现在我想随机使用不同的用户代理为每个请求。我已经成功地使用随机用户代理为scrappy项目,但无法与此蜘蛛集成时,调用此蜘蛛从其他脚本。
我的小任务-

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'random_useragent.RandomUserAgentMiddleware': 320
}

USER_AGENT_LIST = "tutorial/user-agent.txt"

如何让我的CoreSpider.py以编程方式使用这个setting.py配置?

vpfxa7rd

vpfxa7rd1#

查看文档,特别是Common Practices。您可以将设置作为参数提供给CrawlProcess构造函数。或者,如果您使用Scrapy项目并希望从settings.py获取设置,您可以这样做:

...
from scrapy.utils.project import get_project_settings    
process = CrawlerProcess(get_project_settings())
...

相关问题