pandas 如何通过Panda从Python中的Selenium刮取循环打印/导出一系列字符串到csv

1mrurvl1  于 11个月前  发布在  Python
关注(0)|答案(2)|浏览(93)

我是Python的初学者,使用Selenium(+ Chrome)和Pandas从网站上抓取一系列字符串。
下面的代码首先登录到一个网页,创建一个列出的子页面列表(所有的链接名称都是“Detail”),遍历它们,创建一个页面子列表(根据其中带有欧元符号的链接),然后遍历它们以抓取一系列值(building_name,building_code和total_cost)。

网站结构

Main page
    Page Detail 1
        Page €A
            building_name
            building_code
            total_cost
        Page €B
        Page €C
    Page Detail 2
        Page €A
        Page €B
        Page €C
    Page Detail 3
        Page €A
        Page €B
        Page €C

字符串
我现在遇到的问题是使用Panda将这些值输出到一个单独的框架中,并将其导出为CSV。
screenshot of incorrect data
正如您所看到的,数据的方向不正确(建筑名称应该是“Bungalow”),并且只有一个条目,而应该有很多条目。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import re
import pandas as pd

#navigate to home page
driver.get("http://example.com")

# Wait for the new page to load after logins
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, '__layout')))

# Find all links with the text 'Detail'
detail_links = [x.get_attribute('href') for x in driver.find_elements(By.LINK_TEXT, 'Detail')]

for url in detail_links:
    driver.get(url)
    WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, '__layout')))
    
    # Find all the links that have a euro sign in their accompanying text label
    euro_links = [x.get_attribute('href') for x in driver.find_elements(By.XPATH, "//a[span[contains(text(), '€')]]")]

    for url2 in euro_links:
        driver.get(url2)
        WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.ID, '__layout')))
        
        # Get the building cost string
        total_cost = driver.find_element(By.XPATH, '//*[@id="app"]div/span').text
        
        # Get the building name
        building_name = driver.find_element(By.XPATH,'//*[@id="app"]/li[7]/a').text

        # Get the building ID code from the url
        type_url = driver.find_element(By.XPATH,'//*[@id="app"]/li[7]/a').get_attribute("href")
        try:
            building_code = re.search('=(.+?)&', type_url).group(1)
        except AttributeError:
            # AAA, ZZZ not found in the original string
            building_code = ''  # apply your error handling

        # Output the building name, code and total cost to a row in a panda dataframe
        df = pd.DataFrame(list(zip(building_name, building_code, total_cost)), columns=['Name', 'Code', 'Total Costs'])

print(df)

driver.quit()


我试过使用csv,但我认为pandas更容易使用。
有谁能帮我把这些循环的文本字符串正确地输出到pandas框架中,并把它们导出为csv吗?

u3r8eeie

u3r8eeie1#

如果没有url,要给予一个确切的答案就有点困难了,但是让我来展示一下下面的问题。主要的问题是,你在每次迭代中覆盖了你的框架,这就是为什么你只得到一个结果。
最好将你的嵌套附加到一个列表中,并在离开你的循环后对它们进行concat-替代方法是创建一个dict列表,并从那里创建你的嵌套:

...
data = []

for url in detail_links:
    driver.get(url)
    ...    
    for url2 in euro_links:
        driver.get(url2)
        ...    
        # Output the building name, code and total cost to a row in a panda dataframe
        data.append(pd.DataFrame(list(type_name, type_code, bouwkosten_total), columns=['Name', 'Code', 'Total Costs']))

print(pd.concat(data, ignore_index=True))

字符串
另一种方法是存储dicts而不是dataframes

...
data = []

for url in detail_links:
    driver.get(url)
    ...    
    for url2 in euro_links:
        driver.get(url2)
        ...    
        
        data.append({
            'Name':type_name,
            'Code':type_code,
            'Total':bouwkosten_total
        })
print(pd.DataFrame(data))

kb5ga3dv

kb5ga3dv2#

在另一个答案的基础上-这段代码可以工作

...
data = []

for url in detail_links:
    driver.get(url)
    ...    
    for url2 in euro_links:
        driver.get(url2)
        ...    
        # Output the building name, code and total cost to a row in a panda dataframe
        data.append([type_name, type_code, bouwkosten_total])

df = pd.DataFrame(data, columns=['Name', 'Code', 'Building Cost Total'])
print(data)
df.to_csv('data.csv', index=False, header=True)

字符串

相关问题