Scrapy缺少一条记录

kxkpmulp  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(150)

新手对scrapy,一直试图从https://www.citypopulation.de/en/southkorea/busan/admin/抓取网站数据,但它从表中缺少一条记录。
能够搜索其余记录而不会出现问题,例如:

<tbody class="admin1">
<tr class="rname" itemscope="" itemtype="http://schema.org/AdministrativeArea" onclick="javascript:sym('21080')"><td class="rname" id="i21080" data-wiki="Buk District, Busan" data-wd="Q50394" data-area="39.726" data-density="7052.7362"><a href="javascript:sym('21080')"><span itemprop="name">Buk-gu</span></a> [<span itemprop="name">North Distrikt</span>]</td><td class="rstatus">City District</td><td class="rnative"><span itemprop="name">북구</span></td><td class="rpop prio4">329,336</td><td class="rpop prio3">302,141</td><td class="rpop prio2">299,182</td><td class="rpop prio1">280,177</td><td class="sc"><a itemprop="url" href="/en/southkorea/busan/admin/21080__buk_gu/">→</a></td></tr>
</tbody>

<td class="sc">下没有链接时的缺行,例:

<tbody class="admin0">
<tr><td class="rname">Busan</td><td class="rstatus">Metropolitan City</td><td class="rnative"><span itemprop="name">부산광역시</span></td><td class="rpop prio4">3,523,582</td><td class="rpop prio3">3,414,950</td><td class="rpop prio2">3,448,737</td><td class="rpop prio1">3,349,016</td><td class="sc"></td></tr>
</tbody>

编码:

from gc import callbacks
import scrapy

class WebsiteItem(scrapy.Item):
    item_name = scrapy.Field()
    item_status = scrapy.Field()

class WebsiteSpider(scrapy.spiders.CrawlSpider):
    name = "posts"
    start_urls = ["https://www.citypopulation.de/en/southkorea/"]

    rules = (
        scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="div#prov_div > ul > li > a"), follow=True),
        scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="table#tl > tbody > tr > td"), callback="parse")
    )

    def parse(self, response):
        website_item = WebsiteItem()

        website_item['item_name'] = response.css("td.rname span::text").get()
        website_item['item_status'] = response.css("td.rstatus::text").get()

        return website_item

我假设这是因为规则是强制爬行的基础上的链接,但不知道如何解决这个问题,而仍然循环通过每个记录在表中。

rules = (
        scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="div#prov_div > ul > li > a"), follow=True),
        scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="table#tl > tbody > tr > td"), callback="parse")
    )

如果有人能帮我指出我在这里错过了什么,我将不胜感激。

pgvzfuti

pgvzfuti1#

这是获取这些名称/状态对的一种方法:

import scrapy
import pandas as pd

class SkSpider(scrapy.Spider):
    name = 'sk'
    allowed_domains = ['citypopulation.de']
    start_urls = ["https://www.citypopulation.de/en/southkorea/busan/admin/"]

    def parse(self, response):
        df = pd.read_html(response.text)[0]
        for i, row in df.iterrows():
            yield {
                'name': row['Name'],
                'status': row['Status']
            }

使用scrapy crawl sk -o sk_areas.json运行,它将生成一个具有以下结构的json文件:

[
{"name": "Buk-gu [North Distrikt]", "status": "City District"},
{"name": "Deokcheon 1-dong", "status": "Quarter"},
{"name": "Deokcheon 2-dong", "status": "Quarter"},
{"name": "Deokcheon 3-dong", "status": "Quarter"},
{"name": "Geumgok-dong", "status": "Quarter"},
{"name": "Gupo 1-dong", "status": "Quarter"},
{"name": "Gupo 2-dong", "status": "Quarter"},
{"name": "Gupo 3-dong", "status": "Quarter"},
[...]
{"name": "Yeonsan 6-dong", "status": "Quarter"},
{"name": "Yeonsan 8-dong", "status": "Quarter"},
{"name": "Yeonsan 9-dong", "status": "Quarter"},
{"name": "Busan", "status": "Metropolitan City"}
]

正如你所看到的,它也将包括釜山。

相关问题