regex 在Pandas上使用多个分隔符读取文件

fbcarpbf  于 2023-03-04  发布在  其他
关注(0)|答案(2)|浏览(118)

我不能打开一个文件,有;:作为分隔符,而且每行有不同数量的列。
我尝试使用pd.read_csv(PATH, sep = '"\s+|;|:"', engine='python'),只有带有;的部分被分离,而带有:的部分没有。
文本示例:

S;540356274820466;0;NS;2.077867e+01;5:1.23552:4.04445e+12:454.462:0.998828

S;540356274820466;0;SN;2.077867e+01;10.1184:3.19213:1.80215:1.23552:5:2:642.45:601.13:245.744:-450.649:-312.861

B;540356274820466;0;BSN;2.077867e+01;0:1.012e+01:3.192e+00:1.802e+00:6:0:1:1.009e+01:0.000e+00:0.000e+00:1:0:1929.84:0.045349:nan:nan:nan:nan

S;540356274820466;1;NS;2.343362e+01;5:1.23552:8.12127e+12:171.825:0.984511

S;540356274820466;1;SN;2.343362e+01;8.90999:2.75048:1.63983:1.23552:5:2:295.479:238.863:-27.2251:-127.144:200.371

B;471288698479673;1;RLO_BEGIN;5.939171e+00;0:2.580e+01:9.689e+00:0.000e+00:3:0:1.883740e+02:1.804118e+02:1:2.527e+01:0.000e+00:0.000e+00:1:0:2.477457e+01:1.787137e+02:1.02091:1000:473.878:0.0780887

B;471288698479673;1;CIRC;5.939171e+00;473.878:0.0780887:436.873:0

B;471288698479673;1;CE;5.943525e+00;0:2.500e+01:9.851e+00:0.000e+00:3:0:1:2.565e+01:0.000e+00:0.000e+00:1:0:4.269463e+02:3.430075e+02:0
j5fpnvbx

j5fpnvbx1#

您没有多个分隔符。最后一个字段是一个可变项目数的“列表”:

df = pd.read_csv('data.csv', sep=';', header=None).add_prefix('Col')
df = df.join(df.pop('Col5').str.split(':', expand=True))
print(df)

# Output
  Col0             Col1  Col2       Col3       Col4  ...            15       16    17       18         19
0    S  540356274820466     0         NS  20.778670  ...          None     None  None     None       None
1    S  540356274820466     0         SN  20.778670  ...          None     None  None     None       None
2    B  540356274820466     0        BSN  20.778670  ...           nan      nan   nan     None       None
3    S  540356274820466     1         NS  23.433620  ...          None     None  None     None       None
4    S  540356274820466     1         SN  23.433620  ...          None     None  None     None       None
5    B  471288698479673     1  RLO_BEGIN   5.939171  ...  1.787137e+02  1.02091  1000  473.878  0.0780887
6    B  471288698479673     1       CIRC   5.939171  ...          None     None  None     None       None
7    B  471288698479673     1         CE   5.943525  ...          None     None  None     None       None

[8 rows x 25 columns]

如果你想把最后一列保留为list,不要使用expand=True

df = pd.read_csv('data.csv', sep=';', header=None).add_prefix('Col')
df = df.join(df.pop('Col5').str.split(':'))
print(df)

# Output
  Col0             Col1  Col2       Col3       Col4                                               Col5
0    S  540356274820466     0         NS  20.778670       [5, 1.23552, 4.04445e+12, 454.462, 0.998828]
1    S  540356274820466     0         SN  20.778670  [10.1184, 3.19213, 1.80215, 1.23552, 5, 2, 642...
2    B  540356274820466     0        BSN  20.778670  [0, 1.012e+01, 3.192e+00, 1.802e+00, 6, 0, 1, ...
3    S  540356274820466     1         NS  23.433620       [5, 1.23552, 8.12127e+12, 171.825, 0.984511]
4    S  540356274820466     1         SN  23.433620  [8.90999, 2.75048, 1.63983, 1.23552, 5, 2, 295...
5    B  471288698479673     1  RLO_BEGIN   5.939171  [0, 2.580e+01, 9.689e+00, 0.000e+00, 3, 0, 1.8...
6    B  471288698479673     1       CIRC   5.939171                   [473.878, 0.0780887, 436.873, 0]
7    B  471288698479673     1         CE   5.943525  [0, 2.500e+01, 9.851e+00, 0.000e+00, 3, 0, 1, ...
fykwrbwg

fykwrbwg2#

必须指定python引擎。
C引擎无法自动检测分离器,但python引擎可以。

所以试试这个

import pandas as pd

test_path = '/home/.../test.csv'
sep = '|'.join([':', ';'])
header = 0
index_col=False

tst = pd.read_csv(filepath_or_buffer=test_path, sep=sep, header=header, \
                  index_col=index_col, engine='python')
print(tst)

结果

S  540356274820466  0  ...   4.04445e+12    454.462  0.998828
0  S  540356274820466  0  ...  1.802150e+00    1.23552  5.000000
1  B  540356274820466  0  ...  3.192000e+00    1.80200  6.000000
2  S  540356274820466  1  ...  8.121270e+12  171.82500  0.984511
3  S  540356274820466  1  ...  1.639830e+00    1.23552  5.000000
4  B  471288698479673  1  ...  9.689000e+00    0.00000  3.000000
5  B  471288698479673  1  ...  4.368730e+02    0.00000       NaN
6  B  471288698479673  1  ...  9.851000e+00    0.00000  3.000000

[7 rows x 10 columns]

考虑到两个分隔符,您有10列

相关问题