解析xml数据并将其写入csv文件

huwehgph 于 2023-06-03 发布在其他

关注(0)|答案(2)|浏览(202)

我试图从https://www.prisonstudies.org/中抓取并保存有关监狱人口和监狱人口比率的数据。这些数据在特定国家的页面中报告，例如https://www.prisonstudies.org/country/italy。
我必须在一个.csv文件中写下刮取的数据（针对所有国家）。这应该包含4列：国家名称，年份，监狱人口总数，监狱人口日期
我已经做到了一定程度，但我对剩下的部分有点困惑。
预期输出示例：国家名称，年份，监狱总人口，监狱人口日期Algeria，2000，33. 992，108 Algeria，2003，39. 806，122 Algeria，2004，44. 231，134
下面是我的代码：

import requests
import elementpath
from xml.etree import ElementTree as ET
from bs4 import BeautifulSoup
from os.path import basename, dirname,abspath

url = "https://www.prisonstudies.org/world-prison-brief-data"

def parseCountries(url):
    r = requests.get(url)
    soup = ET.parse(r.text, 'lxml')
    regions = soup.findAll('div', {'class' : 'item-list'})
    out = {}
    for reg in regions:
        items = reg.findAll('a', href=True)
        for i in items:
            if i.text.strip() != '':
                out[i.text.strip()] = i['href']
    return(out)

def yearTableParser(countryUrl, countryName):
    r = requests.get(countryUrl)
    soup = BeautifulSoup(r.text, 'lxml')
    yearTab = soup.find('table', {'id':'views-aggregator-datatable'})
    out = []
    if yearTab is not None:
        rows = yearTab.findAll('tr')
        for r in rows:
            dat = r.findAll('td')
            if dat != []:
                out.append([countryName, dat[0].text.strip(),dat[1].text.replace('c','').replace(',','.').strip(),dat[2].text.replace('c','').replace(',','.').strip()])
    return(out)

csv

来源：https://stackoverflow.com/questions/76316979/parsing-xml-data-and-writing-it-on-a-csv-file

2条答案

按热度按时间

qrjkbowd1#

这是一个简化的答案，只解决了从页面中提取所需数据的问题。然后您可以将数据添加到您的csv等。
最后一个列名应该是Prison Population Rate，而不是...Date。

soup = BeautifulSoup(r.text,'lxml') #you don't need ElementTree for this.

#use css selectors to extract the data:
table = soup.select('table[id="views-aggregator-datatable"] tbody  tr')

for row in table:
    entry = [td.text.strip() for td in row.select('td')]
    entry.insert(0,countryName)
    print(entry)

以摩洛哥为例，产出将是：

['Morocco', '2000', '54,288', '187']
['Morocco', '2002', '54,351', '184']
['Morocco', '2004', '59,069', '195']
['Morocco', '2006', '53,580', '174']

等等的。

赞(0）回复(0）举报 2023-06-03

p5fdfcr12#

我没有经验与网页报废，但我玩了一点与此页面：从@Jack Fleeting的想法中，我得到了这个结果：
代码：

import bs4 as bs
import urllib3

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

def grap_page(page):
    ### Create dict country name : link part ###
    map_link = {}
    map_county = {}
    soup  = bs.BeautifulSoup(page,'html.parser') # lxml, xml, html.parser parser possible
    
    for link in soup.find_all("a"):
        if link.has_attr('href') and "/country/" in link.get("href"):
            map_county[link.string] = link.get('href')
            
    org = 'https://www.prisonstudies.org'        
    for country, uri in map_county.items():
        map_link[country] = org+uri
           
    return map_link

def grap_table(countryName, org_uri):
    ### grep table data ###
    data_list = []
    s=Service(ChromeDriverManager().install())
    options = Options()
    options.add_argument('--headless=new') # normal

    driver = webdriver.Chrome(service=s, options=options)
    driver.maximize_window()
    htm = driver.get(org_uri)
    #print(driver.page_source)
    page = driver.page_source
    soup  = bs.BeautifulSoup(page,'html.parser')
  
    # From answer of Mr. Jack Fleeting
    table = soup.select('table[id="views-aggregator-datatable"] tbody  tr')

    for row in table:
        entry = [td.text.strip() for td in row.select('td')]
        entry.insert(0, countryName)
        if entry not in data_list:
            data_list.append(entry)
    return data_list
    

if __name__ == "__main__":
    # Step 1: get the country links
    http = urllib3.PoolManager()
    url = 'https://www.prisonstudies.org/world-prison-brief-data'
    r = http.request('GET', url)
    country_dic = grap_page(r.data)
    
    # Step 2: read the table data from country page
    
    # Dummy for development only    
    tab = grap_table('algerien', 'https://www.prisonstudies.org/country/algeria') #link
    for row in tab:
        print(row)
    
    """ For scrapping all pages 
    for country, link in country_dic.items():
        tab = grap_table(country, link)
        print(tab) ### write it to css file
    """

阿尔及利亚的输出，我运行的不是所有页面，但它应该是一样的：

['algerien', '2000', '33,992', '108']
['algerien', '2003', '39,806', '122']
['algerien', '2004', '44,231', '134']
['algerien', '2006', '54,117', '159']
['algerien', '2008', '55,598', '158']
['algerien', '2010', '58,000', '161']
['algerien', '2012', '55,000', '147']
['algerien', '2014', '61,000', '155']
['algerien', '2016', 'c 60,000', 'c 148']
['algerien', '2018', 'c 63,000', 'c 151']

赞(0）回复(0）举报 2023-06-03

我来回答

解析xml数据并将其写入csv文件

2条答案

相关问题

热门标签

最新问答