解析xml数据并将其写入csv文件

huwehgph  于 2023-06-03  发布在  其他
关注(0)|答案(2)|浏览(202)

我试图从https://www.prisonstudies.org/中抓取并保存有关监狱人口和监狱人口比率的数据。这些数据在特定国家的页面中报告,例如https://www.prisonstudies.org/country/italy
我必须在一个.csv文件中写下刮取的数据(针对所有国家)。这应该包含4列:国家名称,年份,监狱人口总数,监狱人口日期
我已经做到了一定程度,但我对剩下的部分有点困惑。
预期输出示例:国家名称,年份,监狱总人口,监狱人口日期Algeria,2000,33. 992,108 Algeria,2003,39. 806,122 Algeria,2004,44. 231,134
下面是我的代码:

import requests
import elementpath
from xml.etree import ElementTree as ET
from bs4 import BeautifulSoup
from os.path import basename, dirname,abspath

url = "https://www.prisonstudies.org/world-prison-brief-data"

def parseCountries(url):
    r = requests.get(url)
    soup = ET.parse(r.text, 'lxml')
    regions = soup.findAll('div', {'class' : 'item-list'})
    out = {}
    for reg in regions:
        items = reg.findAll('a', href=True)
        for i in items:
            if i.text.strip() != '':
                out[i.text.strip()] = i['href']
    return(out)

def yearTableParser(countryUrl, countryName):
    r = requests.get(countryUrl)
    soup = BeautifulSoup(r.text, 'lxml')
    yearTab = soup.find('table', {'id':'views-aggregator-datatable'})
    out = []
    if yearTab is not None:
        rows = yearTab.findAll('tr')
        for r in rows:
            dat = r.findAll('td')
            if dat != []:
                out.append([countryName, dat[0].text.strip(),dat[1].text.replace('c','').replace(',','.').strip(),dat[2].text.replace('c','').replace(',','.').strip()])
    return(out)
qrjkbowd

qrjkbowd1#

这是一个简化的答案,只解决了从页面中提取所需数据的问题。然后您可以将数据添加到您的csv等。
最后一个列名应该是Prison Population Rate,而不是...Date

soup = BeautifulSoup(r.text,'lxml') #you don't need ElementTree for this.

#use css selectors to extract the data:
table = soup.select('table[id="views-aggregator-datatable"] tbody  tr')

for row in table:
    entry = [td.text.strip() for td in row.select('td')]
    entry.insert(0,countryName)
    print(entry)

以摩洛哥为例,产出将是:

['Morocco', '2000', '54,288', '187']
['Morocco', '2002', '54,351', '184']
['Morocco', '2004', '59,069', '195']
['Morocco', '2006', '53,580', '174']

等等的。

p5fdfcr1

p5fdfcr12#

我没有经验与网页报废,但我玩了一点与此页面:从@Jack Fleeting的想法中,我得到了这个结果:
代码:

import bs4 as bs
import urllib3

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

def grap_page(page):
    ### Create dict country name : link part ###
    map_link = {}
    map_county = {}
    soup  = bs.BeautifulSoup(page,'html.parser') # lxml, xml, html.parser parser possible
    
    for link in soup.find_all("a"):
        if link.has_attr('href') and "/country/" in link.get("href"):
            map_county[link.string] = link.get('href')
            
    org = 'https://www.prisonstudies.org'        
    for country, uri in map_county.items():
        map_link[country] = org+uri
           
    return map_link

def grap_table(countryName, org_uri):
    ### grep table data ###
    data_list = []
    s=Service(ChromeDriverManager().install())
    options = Options()
    options.add_argument('--headless=new') # normal

    driver = webdriver.Chrome(service=s, options=options)
    driver.maximize_window()
    htm = driver.get(org_uri)
    #print(driver.page_source)
    page = driver.page_source
    soup  = bs.BeautifulSoup(page,'html.parser')
  
    # From answer of Mr. Jack Fleeting
    table = soup.select('table[id="views-aggregator-datatable"] tbody  tr')

    for row in table:
        entry = [td.text.strip() for td in row.select('td')]
        entry.insert(0, countryName)
        if entry not in data_list:
            data_list.append(entry)
    return data_list
    

if __name__ == "__main__":
    # Step 1: get the country links
    http = urllib3.PoolManager()
    url = 'https://www.prisonstudies.org/world-prison-brief-data'
    r = http.request('GET', url)
    country_dic = grap_page(r.data)
    
    # Step 2: read the table data from country page
    
    # Dummy for development only    
    tab = grap_table('algerien', 'https://www.prisonstudies.org/country/algeria') #link
    for row in tab:
        print(row)
    
    """ For scrapping all pages 
    for country, link in country_dic.items():
        tab = grap_table(country, link)
        print(tab) ### write it to css file
    """

阿尔及利亚的输出,我运行的不是所有页面,但它应该是一样的:

['algerien', '2000', '33,992', '108']
['algerien', '2003', '39,806', '122']
['algerien', '2004', '44,231', '134']
['algerien', '2006', '54,117', '159']
['algerien', '2008', '55,598', '158']
['algerien', '2010', '58,000', '161']
['algerien', '2012', '55,000', '147']
['algerien', '2014', '61,000', '155']
['algerien', '2016', 'c 60,000', 'c 148']
['algerien', '2018', 'c 63,000', 'c 151']

相关问题