scrapy 如何解决提取数据与零碎,因为从联系人不做任何事情?

35g0bw71  于 2022-11-23  发布在  其他
关注(0)|答案(1)|浏览(129)
import scrapy
    import pycountry
    from locations. Items import GeojsonPointItem
    from locations. Categories import Code
    from typing import List, Dict

    import uuid

创建元数据

#class
    class TridentSpider(scrapy.Spider):
        name: str = 'trident_dac'
        spider_type: str = 'chain'
        spider_categories: List[str] = [Code.MANUFACTURING]
        spider_countries: List[str] = [pycountry.countries.lookup('in').alpha_3]
        item_attributes: Dict[str, str] = {'brand': 'Trident Group'}
        allowed_domains: List[str] = ['tridentindia.com']

    #start script
    def start_requests(self):
        url: str = "https://www.tridentindia.com/contact"

        yield scrapy.Request(
            url=url,
            callback=self.parse_contacts
        )

   `parse data from the website using xpath`

     def parse_contacts(self, response):

        email: List[str] = [
             response.xpath(
            "//*[@id='gatsby-focus- 
            wrapper']/main/div[2]/div[2]/div/div[2]/div/ul/li[1]/a[2]/text()").get()
        ]

        phone: List[str] = [
            response.xpath(
            "//*[@id='gatsby-focus- 
             wrapper']/main/div[2]/div[2]/div/div[2]/div/ul/li[1]/a[1]/text()").get(),
        ]
    
        address: List[str] = [
            response.xpath(
            "//*[@id='gatsby-focus- 
            wrapper']/main/div[2]/div[1]/div/div[2]/div/ul/li[1]/address/text()").get(),
        ]

            dataUrl: str = 'https://www.tridentindia.com/contact'

         yield scrapy.Request(
            dataUrl,
            callback=self. Parse,
            cb_kwargs=dict(email=email, phone=phone, address=address)
         )

解析来自上述def parse的数据(自我、响应、电子邮件:列表[字符串],电话:列表[字符串],地址:列表[字符串]):“”@网址https://www.tridentindia.com/contact' @返回项目1 6@cb_kwargs {“电子邮件”:[”corp@tridentindia.com“],“电话”:【0161-5038888 / 5039999】,【联系地址】:[“E-212,Kitchlu Nagar Ludhiana - 141001,旁遮普省,印度”]} @scrapes参考地址_完整网站'''responseData = response.json()

`response trom data`
    for row in responseData['data']:
        data = {
            "ref": uuid.uuid4().hex,
            'addr_full': address,
            'website': 'https://www.tridentindia.com',
            'email': email,
            'phone': phone,
        }

        yield GeojsonPointItem(**data)

我想从html中提取6个办公室的地址(位置),包括电话号码和电子邮件,因为我找不到一个包含数据的json。在提取结束时,我想将其保存为json,以便能够将其加载到Map上,并检查提取的地址是否与它们的真实的位置匹配。我使用scrapy是因为我想学习它。我对使用scrapy进行网页抓取是新手。

sy5wg1nm

sy5wg1nm1#

有6个办公室,没有一个包含电子邮件。这没有意义,为什么你要包括电子邮件项目,很明显,有6个办公室没有电子邮件,你用来提取数据的方式是不正确和不完美的。所以你可以试试下一个例子。

代码:

import scrapy
class TestSpider(scrapy.Spider):
    name = "test"

    def start_requests(self):
        url = 'https://www.tridentindia.com/contact'
        yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):

        for card in response.xpath('//*[@class="cp-correspondence typ-need-asst"]/ul/li'):
            yield {

                'phone':''.join(card.xpath('.//*[@class="address"]/span[2]//text()').getall()).split(':')[-1].replace('\xad','').strip(),
                'address':card.xpath('.//*[@class="address"]/span[1]/text()').get(),
                'url':response.url
                }

输出为json格式:

[
    {
        "phone": "+91 - 161 - 5039999",
        "address": "E-212, Kitchlu Nagar Ludhiana - 141001, Punjab, India",
        "url": "https://www.tridentindia.com/contact"
    },
    {
        "phone": "1800 180 2999",
        "address": "Trident Group, Sanghera – 148101, India",
        "url": "https://www.tridentindia.com/contact"
    },
    {
        "phone": "0124 - 2350399",
        "address": "25, A, 15 Shahtoot Marg, DLF Phase-1, Sector 26A, Gurugram, Haryana-122002",
        "url": "https://www.tridentindia.com/contact"
    },
    {
        "phone": "0172 - 4602593 / 2742612",
        "address": "SCO 20 - 21, Sector 9D, Madhya Marg, Chandigarh - 160009",
        "url": "https://www.tridentindia.com/contact"
    },
    {
        "phone": "0755 - 2660479",
        "address": "Trident Limited, H.NO. - 3, Nadir Colony, Shyamla Hills, Bhopal - 462013",
        "url": "https://www.tridentindia.com/contact"
    },
    {
        "phone": "01679 - 244700 - 703 - 707",
        "address": "Trident Limited, Sanghera Complex, Raikot Road, Barnala - 148101, Punjab",
        "url": "https://www.tridentindia.com/contact"
    }
]

相关问题