含 selenium 的Python网络刮擦定价和隐藏元素

jxct1oxe  于 2022-11-10  发布在  Python
关注(0)|答案(2)|浏览(244)

在此网页上:https://www.centris.ca/en/properties~for-sale~brossard?view=Thumbnail
我正试着做两件事:
1.获取挂牌商品的价格
1.获取房源的MLS号

from selenium import webdriver

from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time 

url = 'https://www.centris.ca/en/properties~for-sale~brossard?view=Thumbnail'

def scrap_pages(driver):
    listings = driver.find_elements(By.CLASS_NAME, 'description')

    if listings[-1].text.split('/n')[0] == '': del listings[-1]

    for listing in listings:
        print(listing.text.split('\n'))
        price = listing.text.split('\n')[0]

        prop_type = listing.text.split('\n')[1]
        addr = listing.text.split('\n')[2]
        city = listing.text.split('\n')[3]
        sector = listing.text.split('\n')[4]
        bedrooms = listing.text.split('\n')[5]
        bathrooms = listing.text.split('\n')[6]

        listing_item = {
            'price': price,
            'Address': addr,
            'property Type': prop_type,
            'city': city,
            'bedrooms': bedrooms,
            'bathrooms': bathrooms,
            'sector': sector

        }

        centris_list.append(listing_item)

if __name__ == '__main__':
    chrome_options = Options()
    chrome_options.add_experimental_option("detach", True)
    #chrome_options.add_argument("headless")

    driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
    centris_list=[]

    driver.get(url) 

    total_pages = driver.find_element(By.CLASS_NAME,'pager-current').text.split('/')[1].strip() 

    for i in range(1,int(total_pages)):
        scrap_pages(driver)
        driver.find_element(By.CSS_SELECTOR,'li.next> a').click()
        time.sleep(0.8)

我上面的代码已经得到了价格,但不是以我想要的方式。我不喜欢这样一个事实,那就是我必须获得整个描述,然后再浏览整个文本/拆分/列表选择。我试图通过以下方法之一来获取价格,但都没有奏效。它们都返回了找不到元素错误。如果我能让价格发挥作用,我或许也能在其他数据中调整它。


# price= listing.find_element(By.CLASS_NAME, 'price').text

# price= listing.find_element(By.XPATH, './/*[@id="divMainResult"]/div[1]/div/div[2]/a/div[2]/span[1]').text

# price= listing.find_element(By.XPATH, './/*[@id="divMainResult"]/div[1]/div/div[2]/a/div[2]/meta[2]').text

# price = listing.find_element(By.CSS_SELECTOR, '#divMainResult > div:nth-child(1) > div > div.description > a > div.price').text

问题的第二部分,获取MLS编号,不幸的是,我从来没有能让它工作,他们都返回无法找到元素错误。但是如果我查看网页的HTML源,我可以看到每个清单都有一个MLS编号:https://imgur.com/a/ZEoTLoO


# mls= listing.find_element(By.TAG_NAME, 'MlsNumberNoStealth').text

# mls = listing.find_element(By.CSS_SELECTOR, '#MlsNumberNoStealth').text

# mls = listing.find_element(By.ID, 'MlsNumberNoStealth').text

# mls = listing.find_element(By.XPATH, './/*[@id="MlsNumberNoStealth"]/p').text

# mls = listing.find_elements(By.TAG_NAME, 'div')

# mls = listing.find_elements(By.ID, 'MlsNumberNoStealth')
z5btuh9x

z5btuh9x1#

你接近了正确的方法。
一旦您有了listings元素列表,listings = driver.find_elements(By.CLASS_NAME, 'description')行,您就可以迭代列表并获取它们的价格和MLSS,如下所示:

def scrap_pages(driver):
    listings = driver.find_elements(By.CLASS_NAME, 'description')

    for listing in listings:
        price = listing.find_element(By.XPATH, ".//div[@class='price']/meta[@itemprop='price']").text
        mls = listing.find_element(By.XPATH, ".//div[@id='MlsNumberNoStealth']/p").text

所有其他细节都可以用类似的方式获取。

6jygbczu

6jygbczu2#

我不是一个优秀的程序员,但我广泛地使用HTML、CSS和Java脚本。我相信你可以链接一个可执行的Java脚本文件,它经历了类似于...

h1.style.display = "block"
h1 {
display: none;
}
<body>
<h1 id="h1">Hidden data</h1>
</body>

相关问题