有没有可能使用基于文本标准的scrapy shell来抓取HTML表？

koaltpgm 于 2022-11-09 发布在 Shell

关注(0)|答案(2)|浏览(233)

我正在进行网页抓取，我正在尝试使用Scrapy抓取Queensland Lobbyist Registers和主寄存器中的链接。每个说客都有一个链接，可以通过该链接获取他们的客户列表（例如，Antinomies和Australian Public Affairs;但是，这些嵌套表在每个页面中并不一致。例如，对于Antimonies，客户端的xpath是//*[@id="main"]/table[7]*，它从第20行开始，而对于APF，它是//*[@id="main"]/table[6]，它从第24行开始。共同点是，两个客户机子表都在此行下：

“代表或可能代表进行游说活动的客户”

有没有一种方法可以让Scrapy只在每一页的特定行之后读取行？
我一直在使用以下内容：
tableclients = response.xpath('//*[@id="main"]/table[7]//tbody') rowclients = tableclients.xpath('//tr')

scrapy

来源：https://stackoverflow.com/questions/73679831/is-it-possible-to-scrape-html-tables-using-scrapy-shell-based-on-text-criteria

2条答案

按热度按时间

hof1towb1#

是的，可以使用基于文本标准的Scrapy来抓取HTML表，最有可能的是：Client/s on whose behalf lobbying activity is, or may be, conducted。使用contains()方法选择h2标签及其文本节点值，并找到表编号为7的前导同级表，从这里您必须获取所需的数据。

工作代码示例：

from scrapy.crawler import CrawlerProcess

import scrapy

class TestSpider(scrapy.Spider):
    name = 'test'
    custom_settings = {
        #'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        #'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}

    def start_requests(self):

        yield scrapy.Request(
            url='https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists-full-list',
            callback=self.parse,
            #dont_filter=True
            )

    def parse(self, response):
        for Lobbyist in response.xpath('//*[@id="table01546"]/tbody//tr/td[3]/a/@href'):
            link = Lobbyist.get()
            yield scrapy.Request(
                url=link,
                callback = self.parse_client_data,

            )
    def parse_client_data(self, response):
        for tr in response.xpath('(//*[contains(text(),"Returns")]/preceding-sibling::table)[7]/tbody//tr'):
            td1 = ''.join(tr.xpath('.//td[1]//text()').getall()).replace(':','').strip().replace('\xa0','')
            td2 = tr.xpath('.//td[2]//text()')
            td2= ''.join(td2.getall()).strip().replace('\xa0',' ') if td2 else None
            yield {td1: td2}

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

输出：

{'Email Address': 'tara@cmaxadvisory.com.au'}
2022-09-11 22:02:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/CMAX-Communications>
{'ACN/ ABN': '73 130 740 546'}
2022-09-11 22:02:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/CMAX-Communications>
{'Trading Name': 'CMAX Advisory'}
2022-09-11 22:02:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd> (referer: https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists-full-list)
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Company Name': 'Counsel House Pty Ltd'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'CompanyAddress': 'Level 14, 333 Collins Street, Melbourne VIC 3000'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Phone Number': '03 8639 5890'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Email Address': 'info@counselhouse.com.au'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'ACN/ ABN': '35 631 919 009'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Trading Name': 'Counsel House Pty Ltd'}
2022-09-11 22:02:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists> (referer: https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists-full-list)
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Company Name': 'Australian Society of Ophthalmologists'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'CompanyAddress': '6/183 Wickham Terrace, Brisbane QLD 4000'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Phone Number': '07 383103006'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Email Address': 'info@asoeye.org'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'ACN/ ABN': '29 454 001 424'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Trading Name': 'Australian Society of Ophthalmologists'}

 'downloader/response_status_count/200': 52,

 'item_scraped_count': 255,

...等等

赞(0）回复(0）举报 2022-11-09

hrysbysz2#

请尝试以下操作：
//h3[contains(text(), 'Your text')]/following-sibling::div[1]/text()

赞(0）回复(0）举报 2022-11-09