pandas read_csv在每行的第一项和最后一项上显示引号

lf5gs5x2  于 2023-07-31  发布在  其他
关注(0)|答案(2)|浏览(124)

我的.csv文件看起来像这样:

"col1","col2"
"1","text1"
"2","This a "TEXT". However, I cannot parse it."

字符串
也就是说,它在一个值中包含逗号和引号。在read_csv()函数中使用sep参数会在每行的开头和结尾处加上引号:

import pandas as pd

df = pd.read_csv('test.csv', sep = '","', engine = 'python')
df

    "col1   col2"
0   "1      text1"
1   "2      This a "TEXT". However, I cannot parse it."


如何正确读取我的文件?

zzoitvuj

zzoitvuj1#

基于你有趣的想法,你也可以添加第一个和最后一个引号作为分隔符,然后删除不需要的列:

data = io.StringIO('''"col1","col2"
"1","text1"
"2","This a "TEXT". However, I cannot parse it."
''')

df = pd.read_csv(data, sep=r'","|^"|"$', engine='python').iloc[:, 1:-1]

字符串
输出量:

col1                                        col2
0     1                                       text1
1     2  This a "TEXT". However, I cannot parse it.


优点是你可以直接得到正确的类型(如果需要的话):

df.dtypes

col1     int64
col2    object
dtype: object


regex demo

iszxjhcz

iszxjhcz2#

问题是CSV中的逗号或引号都没有转义。使用","作为delimeter是一种聪明的方法,但它在开始和结束处留下了引号。

df.columns = ['col1', 'col2']
df['col1'] = df['col1'].str[1:].astype(int)
df['col2'] = df['col2'].str[:-1]

   col1                                        col2
0     1                                       text1
1     2  This a "TEXT". However, I cannot parse it.

字符串
这里有另一种方法,如果不是查找",",而是对引号进行了前瞻和前瞻:

df = pd.read_csv('test.csv', sep = r'(?<=\"),(?=\")', engine = 'python')

(df.applymap(lambda x: x.strip('"')) # remove quotation marks from the start and end of all values
    .rename(columns = lambda x: x.strip('"')) # same with column names
    .assign(col1 = lambda x: x.col1.astype(int)) # change col1 to be a column of ints)

相关问题