Scrapy所有表数据未被擦除

mefy6pfw  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(147)

我一直在这个website上工作一切都很好,但我不能得到两个以上的数据项,他们是pClose和Diff在表中。没有被打印的任何原因当我试图打印item at 7 indexstock_data[7]时,我得到列表索引错误这背后的任何原因?下面是我的代码

class FloorSheetSpider(scrapy.Spider):
    name = "nepse"
    # allowed_domains = ['nl.indeed.com']

    start_urls = ['https://merolagani.com/LatestMarket.aspx']

    items = []

    def parse(self, response):
        items = NepalLiveShareItem()
        for tr in response.xpath("//table[@class='table table-hover live-trading sortable']//tbody//tr"):
            stock_data = tr.css('td ::text').extract()
            items['symbol'] = stock_data[0]
            items['ltp'] = stock_data[1]
            items['percent_change'] = stock_data[2]
            items['open'] = stock_data[3]
            items['high'] = stock_data[4]
            items['low'] = stock_data[5]
            items['qty'] = stock_data[6]
            yield items
j0pj023g

j0pj023g1#

它们使用API和您可以使用API url获取所有所需的数据项

import scrapy
import json

class FloorSheetSpider(scrapy.Spider):
    name = "nepse"

    def start_requests(self):

        data='data=%7B%22H%22%3A%22stocktickerhub%22%2C%22M%22%3A%22GetAllStocks%22%2C%22A%22%3A%5B%5D%2C%22I%22%3A0%7D'
        headers= {
            'Content-Type': 'application/x-www-form-urlencoded',
            'x-requested-with': 'XMLHttpRequest',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
             }
        api_url='https://merolagani.com/signalr/send?transport=serverSentEvents&connectionToken=0XQfxBGoFiTLTrkPGVyAlOY5VgH1Z35AyaVUdTFmCaVEN8nlgGSGWKXSG0PW4MRx3tlKjoYDk399QSBjc_73v-c_61jxJZ7tRsUBazU-uLGJCRppya00sqkJBkoYSvEdwVAfkQyxCLRKGdpnN0jDUzd1OASpHccFsFEgQFWgRyqXjrB5JN89lLoYn3sk79Kx0'
        yield scrapy.Request(
            url= api_url,
            method='POST',
            headers=headers,
            body=data,
            callback=self.parse,
            dont_filter=True
            )

    def parse(self, response):

        resp = json.loads(response.body)['R']['Stocks']
        for stock in resp.values():

            items = {} #NepalLiveShareItem()# just uncomment the item class instance instead of {}

            items['symbol'] = stock['s']
            items['ltp'] = stock['lp']
            items['percent_change'] = stock['pc']
            items['open'] = stock['op']
            items['high'] = stock['h']
            items['low'] = stock['l']
            items['qty'] = stock['q']
            items['pClose'] = stock['pl']
            items['Diff'] = stock['c']
            yield items

输出:

{'symbol': 'JFL', 'ltp': 372.0, 'percent_change': 0.54, 'open': 363.0, 'high': 375.0, 'low': 363.0, 'qty': 6495.0, 'pClose': 370.0, 'Diff': 2.0}
2022-07-16 21:47:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://merolagani.com/signalr/send?transport=serverSentEvents&connectionToken=0XQfxBGoFiTLTrkPGVyAlOY5VgH1Z35AyaVUdTFmCaVEN8nlgGSGWKXSG0PW4MRx3tlKjoYDk399QSBjc_73v-c_61jxJZ7tRsUBazU-uLGJCRppya00sqkJBkoYSvEdwVAfkQyxCLRKGdpnN0jDUzd1OASpHccFsFEgQFWgRyqXjrB5JN89lLoYn3sk79Kx0>
{'symbol': 'MMFDB', 'ltp': 1050.0, 'percent_change': 0.48, 'open': 1053.0, 'high': 1053.0, 'low': 1050.0, 'qty': 30.0, 'pClose': 1045.0, 'Diff': 5.0}
2022-07-16 21:47:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://merolagani.com/signalr/send?transport=serverSentEvents&connectionToken=0XQfxBGoFiTLTrkPGVyAlOY5VgH1Z35AyaVUdTFmCaVEN8nlgGSGWKXSG0PW4MRx3tlKjoYDk399QSBjc_73v-c_61jxJZ7tRsUBazU-uLGJCRppya00sqkJBkoYSvEdwVAfkQyxCLRKGdpnN0jDUzd1OASpHccFsFEgQFWgRyqXjrB5JN89lLoYn3sk79Kx0>
{'symbol': 'IGI', 'ltp': 370.0, 'percent_change': -1.33, 'open': 380.0, 'high': 380.0, 'low': 
361.0, 'qty': 19847.0, 'pClose': 375.0, 'Diff': -5.0}
2022-07-16 21:47:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://merolagani.com/signalr/send?transport=serverSentEvents&connectionToken=0XQfxBGoFiTLTrkPGVyAlOY5VgH1Z35AyaVUdTFmCaVEN8nlgGSGWKXSG0PW4MRx3tlKjoYDk399QSBjc_73v-c_61jxJZ7tRsUBazU-uLGJCRppya00sqkJBkoYSvEdwVAfkQyxCLRKGdpnN0jDUzd1OASpHccFsFEgQFWgRyqXjrB5JN89lLoYn3sk79Kx0>
{'symbol': 'KBLD86', 'ltp': 1006.0, 'percent_change': 5.68, 'open': 970.9, 'high': 1009.5, 'low': 970.9, 'qty': 55.0, 'pClose': 951.9, 'Diff': 54.1}
2022-07-16 21:47:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://merolagani.com/signalr/send?transport=serverSentEvents&connectionToken=0XQfxBGoFiTLTrkPGVyAlOY5VgH1Z35AyaVUdTFmCaVEN8nlgGSGWKXSG0PW4MRx3tlKjoYDk399QSBjc_73v-c_61jxJZ7tRsUBazU-uLGJCRppya00sqkJBkoYSvEdwVAfkQyxCLRKGdpnN0jDUzd1OASpHccFsFEgQFWgRyqXjrB5JN89lLoYn3sk79Kx0>
{'symbol': 'JALPA', 'ltp': 2318.0, 'percent_change': 1.31, 'open': 2260.5, 'high': 2324.9, 'low': 2260.5, 'qty': 454.0, 'pClose': 2288.0, 'Diff': 30.0}
2022-07-16 21:47:08 [scrapy.core.engine] INFO: Closing spider (finished)

 'downloader/response_status_count/200': 1,

 'item_scraped_count': 231,

...等等

bvk5enib

bvk5enib2#

正确导入 selenium 元素并使用chromedriver,即使在Jupyter笔记本电脑中也是如此:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

虽然要找到接收实时数据的特定<td>非常困难,但我们可以查看页面,找到接收实时数据时出现的另一个元素:

browser.get("https://merolagani.com/LatestMarket.aspx")
live_data = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="index-slider"]')))
dfs= pd.read_html(browser.page_source)
browser.quit() ## very important, otherwise you end up using all memory
print(dfs[0].iloc[:,:9])

这将以 Dataframe 的形式返回包含实时值的表。此外,您只需等待页面中加载了实时数据。

相关问题