在Python中解析CSV:从列中提取全文的问题

dced5bon  于 11个月前  发布在  Python
关注(0)|答案(3)|浏览(99)

我正在用Python创建一个不带任何参数的函数get_data()。它的目的是加载一个名为tx_deathrow_full.csv的文件并返回一个字典列表。每个字典应该有18个字段对应于数据集中的一行。
数据集的更新版本可以在以下位置找到:
https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html
我的尝试:
我已经写了一段代码,几乎完成了这一点:

def get_data():
    deathrow_data = []
    with open('tx_deathrow_full.csv', 'r') as fil:
        # Skips the first two rows (because of how the csv-file is given)
        next(fil)
        next(fil)

        reader = csv.reader(fil)
        for count, row in enumerate(reader):
            if count < 5:  # Sample of 5 rows
                row_data = row[0].split(',')
                row_dict = {
                    'Execution': int(row_data[0]),
                    'Date of Birth': row_data[1],
                    'Date of Offence': row_data[2],
                    'Highest Education Level': int(row_data[3]),
                    'Last Name': row_data[4],
                    'First Name': row_data[5],
                    'TDCJ Number': int(row_data[6]),
                    'Age at Execution': int(row_data[7]),
                    'Date Received': row_data[8],
                    'Execution Date': row_data[9],
                    'Race': row_data[10],
                    'County': row_data[11],
                    'Eye Color': row_data[12],
                    'Weight': int(row_data[13]),
                    'Height': row_data[14],
                    'Native County': row_data[15],
                    'Native State': row_data[16],
                    'Last Statement': row_data[17].rstrip(';') # To remove ';' in the end of the Last Statement.
                }
                deathrow_data.append(row_dict)
            else:
                break
    return deathrow_data

字符串
然而,“最后一条语句”字段出现了问题。它只捕获到第一个逗号的文本,并切断其余部分,或者它无法包括所有文本。
范例:
在CSV文件中,第6行为:

550,1987-04-04,2008-04-06,11,Davila,Erick Daniel,999545,31,2009-02-27,2018-04-25,Black,Tarrant,Brown,161,"5' 11""",Tarrant,Texas,"Yes, I would like to say nephew it burns huh. You know I might have lost the fight but I'm still a soldier. I still love you all. To my supporters and family y'all hold it down. Ten Toes down right. That's all."


但我的代码返回:

550,1987-04-04,2008-04-06,11,Davila,Erick Daniel,999545,31,2009-02-27,2018-04-25,Black,Tarrant,Brown,161,"5' 11""",Tarrant,Texas,"Yes, I would like to say nephew it burns huh. You know I might have lost the fight but I'm still a soldier. I still love you all. To my supporters and family y'all hold it down. Ten Toes down right. That's all."


但是当我运行我的代码时,它只返回以下输出:

[... , {'Execution': 550, 'Date of Birth': '1987-04-04', 'Date of Offence': '2008-04-06', 'Highest Education Level': 11, 'Last Name': 'Davila', 'First Name': 'Erick Daniel', 'TDCJ Number': 999545, 'Age at Execution': 31, 'Date Received': '2009-02-27', 'Execution Date': '2018-04-25', 'Race': 'Black', 'County': 'Tarrant', 'Eye Color': 'Brown', 'Weight': 161, 'Height': '"5\' 11"""', 'Native County': 'Tarrant', 'Native State': 'Texas', 'Last Statement': '"Yes'}, ...]


请注意,“最后一项声明”被错误地截断为“是”,而不是全文。
This is the CSV-file before downloading.
This is the CSV-file when i have downloaded in and opened it in Excel.

**注意:**注意,两个片段都有所示的示例
问题:

考虑到CSV文件的结构,如何修改代码以确保捕获完整的“Last Statement”文本并将其包含在字典中?

我尝试过的:

我尝试使用Pandas库来解决这个问题,但没有成功,因为我不熟悉它的功能。
此外,我之前尝试了以下代码片段,但后来将其修改为我之前共享的版本,因为它似乎更接近于产生所需的结果。

def get_data():
    """
    Returnerer deathrow_data opdateret!
    """
    deathrow_data = []  # Tom liste til rækkerne
    with open('tx_deathrow_full.csv', 'r') as fil:
            fieldnames1 = ['Execution', 'Date of Birth', 'Date of Offence',             'Highest Education Level', 'Last Name',
                      'First Name', 'TDCJ\nNumber', 'Age at Execution', 'Date Received', 'Execution Date', 'Race', 
                       'County', 'Eye Color', 'Weight', 'Height', 'Native County', 'Native State', 'Last Statement']
        reader = csv.DictReader(fil, fieldnames=fieldnames1, delimiter=';')   

        next(reader)  # Springer første linje over (header)
        next(reader)  # Springer anden linje over

        count = 0
        for row in reader:
            if count < 4:  # Laver en sample på 3 rækker
                deathrow_data.append(row)
                count += 1
            else:
                break
    return deathrow_data

result = get_data()
print(result)


不要介意代码中的丹麦注解。

webghufk

webghufk1#

不需要使用row[0].split(','),直接访问csv.reader()返回的行元素即可

def get_data():
    deathrow_data = []
    with open('tx_deathrow_full.csv', 'r') as fil:
        # Skips the first two rows (because of how the csv-file is given)
        next(fil)
        next(fil)

        reader = csv.reader(fil)
        for count, row_data in enumerate(reader):
            if count < 5:  # Sample of 5 rows
                row_dict = {
                    'Execution': int(row_data[0]),
                    'Date of Birth': row_data[1],
                    'Date of Offence': row_data[2],
                    'Highest Education Level': int(row_data[3]),
                    'Last Name': row_data[4],
                    'First Name': row_data[5],
                    'TDCJ Number': int(row_data[6]),
                    'Age at Execution': int(row_data[7]),
                    'Date Received': row_data[8],
                    'Execution Date': row_data[9],
                    'Race': row_data[10],
                    'County': row_data[11],
                    'Eye Color': row_data[12],
                    'Weight': int(row_data[13]),
                    'Height': row_data[14],
                    'Native County': row_data[15],
                    'Native State': row_data[16],
                    'Last Statement': row_data[17].rstrip(';') # To remove ';' in the end of the Last Statement.
                }
                deathrow_data.append(row_dict)
            else:
                break
    return deathrow_data

字符串
我用你在问题中发布的第6行测试了这个,它没有错误。

[{'Execution': 550,
  'Date of Birth': '1987-04-04',
  'Date of Offence': '2008-04-06',
  'Highest Education Level': 11,
  'Last Name': 'Davila',
  'First Name': 'Erick Daniel',
  'TDCJ Number': 999545,
  'Age at Execution': 31,
  'Date Received': '2009-02-27',
  'Execution Date': '2018-04-25',
  'Race': 'Black',
  'County': 'Tarrant',
  'Eye Color': 'Brown',
  'Weight': 161,
  'Height': '5\' 11"',
  'Native County': 'Tarrant',
  'Native State': 'Texas',
  'Last Statement': 'Yes, I would like to say nephew it burns huh. You know I '
                    "might have lost the fight but I'm still a soldier. I "
                    "still love you all. To my supporters and family y'all "
                    "hold it down. Ten Toes down right. That's all."}]

os8fio9y

os8fio9y2#

我还不能发表评论,但除了巴尔马的回答,我建议:
1.用new line =''打开文件,因为the docs suggest to do so
1.显式地传递reader:reader = csv.reader(fil,reader =',')
希望能帮上忙!

8hhllhi2

8hhllhi23#

你看过csv模块的DictReader了吗?它会读取头部,然后将每一行读入一个dict,并将头部的字段名作为键名。给定一个简单的CSV,如:

input.csv
---------
Col1,Col2,Col3
r1c1,r1c2,r1c3
r2c1,r2c2,r2c3
r3c1,r3c2,r3c3

个字符
打印:

[
    {"Col1": "r1c1", "Col2": "r1c2", "Col3": "r1c3"},
    {"Col1": "r2c1", "Col2": "r2c2", "Col3": "r2c3"},
    {"Col1": "r3c1", "Col2": "r3c2", "Col3": "r3c3"},
]


我从pydango/selectstarsql/sql_buddy中找到了旧的数据集,它在2018年停止。你的标题看起来像这样吗,带有换行符(我假设是这样,因为你跳过了前两行并评论了CSV是如何制作的)?

Execution,Date of Birth,Date of Offence,Highest Education Level,Last Name,First Name,"TDCJ
Number",Age at Execution,Date Received,Execution Date,Race,County,Eye Color,Weight,Height,Native County,Native State,Last Statement


我建议您手动编辑CSV并删除TDCJ之后的换行符,或者在Python中预处理文件,如下所示:

with open("input_tx_dr.csv", newline="") as f:
    text = f.read()

text = text.replace("TDCJ\nNumber", "TDCJ Number")

with open("input_tx_dr_fixed.csv", "w") as f:
    f.write(text)


现在,有了一个好的header,你可以让DictReader完成大部分工作:

from datetime import datetime

samples = []

with open("input_tx_dr_fixed.csv", newline="") as f:
    reader = csv.DictReader(f)

    for i, row in enumerate(reader):
        if i == 5:
            break

        for k in [
            "Execution",
            "Highest Education Level",
            "TDCJ Number",
            "Age at Execution",
            "Weight",
        ]:
            row[k] = int(row[k])

        for k in [
            "Date of Birth",
            "Date of Offence",
            "Date Received",
            "Execution Date",
        ]:
            row[k] = datetime.strptime(row[k], r"%Y-%m-%d")

        samples.append(row)

相关问题