新手对scrapy,一直试图从https://www.citypopulation.de/en/southkorea/busan/admin/抓取网站数据,但它从表中缺少一条记录。
能够搜索其余记录而不会出现问题,例如:
<tbody class="admin1">
<tr class="rname" itemscope="" itemtype="http://schema.org/AdministrativeArea" onclick="javascript:sym('21080')"><td class="rname" id="i21080" data-wiki="Buk District, Busan" data-wd="Q50394" data-area="39.726" data-density="7052.7362"><a href="javascript:sym('21080')"><span itemprop="name">Buk-gu</span></a> [<span itemprop="name">North Distrikt</span>]</td><td class="rstatus">City District</td><td class="rnative"><span itemprop="name">북구</span></td><td class="rpop prio4">329,336</td><td class="rpop prio3">302,141</td><td class="rpop prio2">299,182</td><td class="rpop prio1">280,177</td><td class="sc"><a itemprop="url" href="/en/southkorea/busan/admin/21080__buk_gu/">→</a></td></tr>
</tbody>
<td class="sc">
下没有链接时的缺行,例:
<tbody class="admin0">
<tr><td class="rname">Busan</td><td class="rstatus">Metropolitan City</td><td class="rnative"><span itemprop="name">부산광역시</span></td><td class="rpop prio4">3,523,582</td><td class="rpop prio3">3,414,950</td><td class="rpop prio2">3,448,737</td><td class="rpop prio1">3,349,016</td><td class="sc"></td></tr>
</tbody>
编码:
from gc import callbacks
import scrapy
class WebsiteItem(scrapy.Item):
item_name = scrapy.Field()
item_status = scrapy.Field()
class WebsiteSpider(scrapy.spiders.CrawlSpider):
name = "posts"
start_urls = ["https://www.citypopulation.de/en/southkorea/"]
rules = (
scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="div#prov_div > ul > li > a"), follow=True),
scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="table#tl > tbody > tr > td"), callback="parse")
)
def parse(self, response):
website_item = WebsiteItem()
website_item['item_name'] = response.css("td.rname span::text").get()
website_item['item_status'] = response.css("td.rstatus::text").get()
return website_item
我假设这是因为规则是强制爬行的基础上的链接,但不知道如何解决这个问题,而仍然循环通过每个记录在表中。
rules = (
scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="div#prov_div > ul > li > a"), follow=True),
scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="table#tl > tbody > tr > td"), callback="parse")
)
如果有人能帮我指出我在这里错过了什么,我将不胜感激。
1条答案
按热度按时间pgvzfuti1#
这是获取这些名称/状态对的一种方法:
使用
scrapy crawl sk -o sk_areas.json
运行,它将生成一个具有以下结构的json文件:正如你所看到的,它也将包括釜山。