python 使用scrapy抓取HTML页面(SEC年度报告)的问题

w8f9ii69  于 2023-03-21  发布在  Python
关注(0)|答案(1)|浏览(163)

如果我犯了任何明显的错误,我提前道歉,我是Python和Scrapy的新手。
我试图抓取苹果的10 k表格(link to form)。具体来说,我只是试图抓取位于第二部分第5项的一个表,名为“发行人和关联购买者购买股权证券”。
我的代码是:

import scrapy

class AaplSpider(scrapy.Spider):
    name = "aapl"
    allowed_domains = ["www.sec.gov"]
    start_urls = ["https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm"]

    def parse(self, response):
        for row in response.xpath('(//table)[13]/tbody/tr'):
            yield {
            'Periods' : row.xpath('td[1]//text()').extract_first(),
            'Total Number of Shares Purchased': row.xpath('td[3]//text()').extract_first(),
            'Average Price Paid Per Share' : row.xpath('td[7]//text()').extract_first(),
            }

然而,我似乎无法找到这个表,即使我可以在chrome((//table)[13]/tbody)中使用这个xpath表达式正确地选择它。是这个表以某种方式被阻塞了,还是我做错了什么?
任何帮助都将不胜感激,谢谢。

tvz2xvvm

tvz2xvvm1#

有两件事让你无法解析你想要的信息。
1.你试图抓取的URL并不包含你想要的内容。它是在初始页面加载后通过javascript呈现的。当你刷新页面时显示的旋转加载图标是这一事实的一个泄露,但另一个是如果点击上下文菜单中的view page source,它会显示URL的html,这将是非常明显的。
1.出于某种我不知道的原因,某些web浏览器在查看开发工具inspect窗格中的页面内容时会将html标记注入到文档主体中。在您的示例中,tbody元素实际上并不存在于页面源html中。
第一个问题的解决方案是简单地使用您实际需要的内容的url,这可以在下面的示例中找到,第二个问题的解决方案是省略xpath表达式的tbody部分。
例如:

import scrapy

class AaplSpider(scrapy.Spider):
    name = "aapl"
    allowed_domains = ["www.sec.gov"]
    start_urls = ["https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm"]

    def parse(self, response):
        for row in response.xpath('(//table)[13]/tr'):
            yield {
            'Periods' : row.xpath('./td[1]//text()').extract_first(),
            'Total Number of Shares Purchased': row.xpath('./td[3]//text()').extract_first(),
            'Average Price Paid Per Share' : row.xpath('./td[7]//text()').extract_first(),
            }

输出:

2023-03-20 23:45:19 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-03-20 23:45:19 [scrapy.core.engine] INFO: Spider opened
2023-03-20 23:45:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-03-20 23:45:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-03-20 23:45:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm> (referer: None)
2023-03-20 23:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm>
{'Periods': None, 'Total Number of Shares Purchased': None, 'Average Price Paid Per Share': None}
2023-03-20 23:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm>
{'Periods': 'Periods', 'Total Number of Shares Purchased': 'Total Number', 'Average Price Paid Per Share': 'Total Number of Shares'}
2023-03-20 23:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm>
{'Periods': 'June 26, 2022 to July 30, 2022:', 'Total Number of Shares Purchased': None, 'Average Price Paid Per Share': None}
2023-03-20 23:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm>
{'Periods': 'Open market and privately negotiated purchases', 'Total Number of Shares Purchased': '41,690\xa0', 'Average Price Paid Per Share': '145.91\xa0'}
2023-03-20 23:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm>
{'Periods': None, 'Total Number of Shares Purchased': None, 'Average Price Paid Per Share': None}
2023-03-20 23:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm>
{'Periods': 'July 31, 2022 to August 27, 2022:', 'Total Number of Shares Purchased': None, 'Average Price Paid Per Share': None}
2023-03-20 23:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm>
{'Periods': 'Open market and privately negotiated purchases', 'Total Number of Shares Purchased': '54,669\xa0', 'Average Price Paid Per Share': '168.29\xa0'}
2023-03-20 23:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm>
{'Periods': None, 'Total Number of Shares Purchased': None, 'Average Price Paid Per Share': None}
2023-03-20 23:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm>
{'Periods': 'August 28, 2022 to September 24, 2022:', 'Total Number of Shares Purchased': None, 'Average Price Paid Per Share': None}
2023-03-20 23:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm>
{'Periods': 'Open market and privately negotiated purchases', 'Total Number of Shares Purchased': '63,813\xa0', 'Average Price Paid Per Share': '155.59\xa0'}
2023-03-20 23:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm>
{'Periods': 'Total', 'Total Number of Shares Purchased': '160,172\xa0', 'Average Price Paid Per Share': None}

相关问题