pandas 如何通过Panda从Python中的Selenium刮取循环打印/导出一系列字符串到csv

1mrurvl1 于 2024-01-04 发布在 Python

关注(0)|答案(2)|浏览(101)

我是Python的初学者，使用Selenium（+ Chrome）和Pandas从网站上抓取一系列字符串。
下面的代码首先登录到一个网页，创建一个列出的子页面列表（所有的链接名称都是“Detail”），遍历它们，创建一个页面子列表（根据其中带有欧元符号的链接），然后遍历它们以抓取一系列值（building_name，building_code和total_cost）。

网站结构

Main page
    Page Detail 1
        Page €A
            building_name
            building_code
            total_cost
        Page €B
        Page €C
    Page Detail 2
        Page €A
        Page €B
        Page €C
    Page Detail 3
        Page €A
        Page €B
        Page €C

字符串
我现在遇到的问题是使用Panda将这些值输出到一个单独的框架中，并将其导出为CSV。
screenshot of incorrect data
正如您所看到的，数据的方向不正确（建筑名称应该是“Bungalow”），并且只有一个条目，而应该有很多条目。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import re
import pandas as pd

#navigate to home page
driver.get("http://example.com")

# Wait for the new page to load after logins
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, '__layout')))

# Find all links with the text 'Detail'
detail_links = [x.get_attribute('href') for x in driver.find_elements(By.LINK_TEXT, 'Detail')]

for url in detail_links:
    driver.get(url)
    WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, '__layout')))
    
    # Find all the links that have a euro sign in their accompanying text label
    euro_links = [x.get_attribute('href') for x in driver.find_elements(By.XPATH, "//a[span[contains(text(), '€')]]")]

    for url2 in euro_links:
        driver.get(url2)
        WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.ID, '__layout')))
        
        # Get the building cost string
        total_cost = driver.find_element(By.XPATH, '//*[@id="app"]div/span').text
        
        # Get the building name
        building_name = driver.find_element(By.XPATH,'//*[@id="app"]/li[7]/a').text

        # Get the building ID code from the url
        type_url = driver.find_element(By.XPATH,'//*[@id="app"]/li[7]/a').get_attribute("href")
        try:
            building_code = re.search('=(.+?)&', type_url).group(1)
        except AttributeError:
            # AAA, ZZZ not found in the original string
            building_code = ''  # apply your error handling

        # Output the building name, code and total cost to a row in a panda dataframe
        df = pd.DataFrame(list(zip(building_name, building_code, total_cost)), columns=['Name', 'Code', 'Total Costs'])

print(df)

driver.quit()

型
我试过使用csv，但我认为pandas更容易使用。
有谁能帮我把这些循环的文本字符串正确地输出到pandas框架中，并把它们导出为csv吗？

pandas

来源：https://stackoverflow.com/questions/77742827/how-to-print-export-a-series-of-strings-to-csv-via-panda-from-a-selenium-scraped

2条答案

按热度按时间

u3r8eeie1#

如果没有url，要给予一个确切的答案就有点困难了，但是让我来展示一下下面的问题。主要的问题是，你在每次迭代中覆盖了你的框架，这就是为什么你只得到一个结果。
最好将你的嵌套附加到一个列表中，并在离开你的循环后对它们进行concat-替代方法是创建一个dict列表，并从那里创建你的嵌套：

...
data = []

for url in detail_links:
    driver.get(url)
    ...    
    for url2 in euro_links:
        driver.get(url2)
        ...    
        # Output the building name, code and total cost to a row in a panda dataframe
        data.append(pd.DataFrame(list(type_name, type_code, bouwkosten_total), columns=['Name', 'Code', 'Total Costs']))

print(pd.concat(data, ignore_index=True))

字符串
另一种方法是存储dicts而不是dataframes：

...
data = []

for url in detail_links:
    driver.get(url)
    ...    
    for url2 in euro_links:
        driver.get(url2)
        ...    
        
        data.append({
            'Name':type_name,
            'Code':type_code,
            'Total':bouwkosten_total
        })
print(pd.DataFrame(data))

型

赞(0）回复(0）举报 2024-01-04

kb5ga3dv2#

在另一个答案的基础上-这段代码可以工作

...
data = []

for url in detail_links:
    driver.get(url)
    ...    
    for url2 in euro_links:
        driver.get(url2)
        ...    
        # Output the building name, code and total cost to a row in a panda dataframe
        data.append([type_name, type_code, bouwkosten_total])

df = pd.DataFrame(data, columns=['Name', 'Code', 'Building Cost Total'])
print(data)
df.to_csv('data.csv', index=False, header=True)

字符串

赞(0）回复(0）举报 2024-01-04

我来回答

pandas 如何通过Panda从Python中的Selenium刮取循环打印/导出一系列字符串到csv

2条答案

相关问题

热门标签

最新问答