python 如何有所有的结果使用beautifulsoup selenium 和禁用自动测试页面?

u3r8eeie  于 2023-02-15  发布在  Python
关注(0)|答案(1)|浏览(158)

我试图网页抓取网站不知何故,它只显示24个结果,我如何加载所有结果与隐藏的自动测试页?
下面的代码:

# import library
    from selenium import webdriver
    from selenium.webdriver import Chrome
    import pandas as pd
    import bs4

    #create list
    items = []
    prices = []
    volumes = []

    driver = webdriver.Chrome()
    driver.get("https://www.fairprice.com.sg/category/milk-powder")
    soup = bs4.BeautifulSoup(driver.page_source, 'lxml')
    allelem = soup.find_all('div',class_='sc-1plwklf-0 iknXK product-container')

    #read all element
    for item in allelem:
      items.append(item.find('span', class_='sc-1bsd7ul-1 eJoyLL').text.strip())
  
    #read price
    for price in allelem:
      prices.append(price.find('span', class_='sc-1bsd7ul-1 sc-1svix5t-1 gJhHzP biBzHY').text.strip())

    #read volume
    for volume in allelem:
      volumes.append(volume.find('span', class_='sc-1bsd7ul-1 eeyOqy').text.strip())

    print(items)
    print(volumes)
    print(prices)

    #create dataframe
    final_array = []
    for item,price,volume in zip(items,prices,volumes):
     final_array.append({'Item':item,'Volume':volume,'Price':price})
    
    # covert to excel
    df = pd.DataFrame(final_array)
    print(df)
    df.to_excel('ntucv4milk.xlsx',index=False)

码结束

8cdiaqws

8cdiaqws1#

我的建议是定义三个列表(商品、价格、成交量),通过向下滚动页面,这些列表会逐渐增长。如果您有一个elements的Web元素列表,您可以通过运行

driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', elements[-1])

然后,您所要做的就是等待新项加载,然后将它们添加到这三个列表中,如果在给定的时间(max_wait,即10秒)内没有项加载,则可能没有更多项要加载,我们可以中断循环。

items, prices, volumes = [], [], []
c = 0 # counter
max_wait = 10
no_new_items = False

while 1:
    items_new = driver.find_elements(By.CSS_SELECTOR, 'span[class="sc-1bsd7ul-1 eJoyLL"]')
    items   += [item.text.strip()  for item  in items_new[c:]]
    prices  += [price.text.strip() for price in driver.find_elements(By.CSS_SELECTOR, 'span[class="sc-1bsd7ul-1 sc-1svix5t-1 gJhHzP biBzHY"]')[c:]]
    volumes += [vol.text.strip()   for vol   in driver.find_elements(By.XPATH, '//span[@class="sc-1bsd7ul-1 eeyOqy"][1]')[c:]]
    counter = len(items)
    print(counter,'items scraped',end='\r')
    
    driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', items_new[-1])

    items_loaded = items_new.copy()
    start = time.time()
    # wait up to `max_wait` seconds for new elements to be loaded
    while len(items_new) == len(items_loaded):
        items_loaded = driver.find_elements(By.CSS_SELECTOR, 'span[class="sc-1bsd7ul-1 eJoyLL"]')
        if time.time() - start > max_wait:
            no_new_items = True
            break
    if no_new_items:
        break

pd.DataFrame({'item':items,'price':prices,'volume':volumes})

产出

相关问题