csv 匹配模式并将以下行放入数据结构中

x33g5p2x  于 2022-12-06  发布在  其他
关注(0)|答案(2)|浏览(112)

我有一个数据源,我定期下载到一个csv。它看起来像这样

TABLE # 196712 / 9000_
>= 10   : 0.002
>= 5    : 0.001
>= 2    : 0.0005
>= 1    : 0.0002
>= 0.5  : 0.0001
>= 0.2  : 0.0001
>= 0.1  : 0.0001
>= 0.0001   : 0.0001
TABLE # 196714 / Dark
>= 0.0001   : 5e-05
TABLE # 196715 / GBD
>= 25   : 0.01
>= 10   : 0.005
>= 5    : 0.0025
>= 0.1  : 0.001
>= 0.0005   : 0.005

我希望解析文件并将数据分类到字典中,其中哈希后面的数字是唯一的id(新的dict键),后面的行(以〉=开头)是卷数加上相关的惩罚值。
s.th 像这样的工作方式:

{196712: [(10,0.002),(5,0.001),(2,0.0005),(1,0.0002),(0.5,0.0001),(0.2,0.0001),(0.1,0.0001),(0.0001, 0.0001)], 
 196714: [(0.0001,5e-05)], 
 196715: [(25,0.01),(10,0.005),(5,0.0025),(0.1,0.001),(0.0005,0.005)]}

我在python外部过滤它的方法是使用grep并得到以下几行,但是ID之间的行数不同会使它变得更加复杂,也可以使用任何其他建议的更方便的数据结构。

x4shl7ld

x4shl7ld1#

试试看:

s = """\
TABLE # 196712 / 9000_
>= 10   : 0.002
>= 5    : 0.001
>= 2    : 0.0005
>= 1    : 0.0002
>= 0.5  : 0.0001
>= 0.2  : 0.0001
>= 0.1  : 0.0001
>= 0.0001   : 0.0001
TABLE # 196714 / Dark
>= 0.0001   : 5e-05
TABLE # 196715 / GBD
>= 25   : 0.01
>= 10   : 0.005
>= 5    : 0.0025
>= 0.1  : 0.001
>= 0.0005   : 0.005"""

import re

out = {}
for table, data in re.findall(
    r"^TABLE # (\d+).*?\n(.*?)(?=^TABLE|\Z)", s, flags=re.M | re.S
):
    table = int(table)
    for a, b in re.findall(r"([\de.+-]+)\s*:\s*([\de.+-]+)", data):
        out.setdefault(table, []).append((float(a), float(b)))

print(out)

印刷品:

{
    196712: [
        (10.0, 0.002),
        (5.0, 0.001),
        (2.0, 0.0005),
        (1.0, 0.0002),
        (0.5, 0.0001),
        (0.2, 0.0001),
        (0.1, 0.0001),
        (0.0001, 0.0001),
    ],
    196714: [(0.0001, 5e-05)],
    196715: [
        (25.0, 0.01),
        (10.0, 0.005),
        (5.0, 0.0025),
        (0.1, 0.001),
        (0.0005, 0.005),
    ],
}
jmp7cifd

jmp7cifd2#

import fileinput
import sys
import re
from collections import defaultdict
from pprint import pprint

def parse_records(lines):
    for l in lines:
        if m := re.match(r'TABLE # (\d+) /.*', l):
            yield m.groups()[0]

        if m := re.match(r'>= (\S+)\s+: (.*)', l):
            yield m.groups()

result = defaultdict(list)
record_id = None
for l in parse_records(fileinput.input()):
    match l:
        case (volume, penality):
            result[record_id].append((float(volume), float(penality)))
        case id:
            record_id=int(id)

print("{")
for key, value in result.items():
    print(f" {key}: {value}")
print("}")

运行它:

% python3 t2.py < input.txt
{
 196712: [(10.0, 0.002), (5.0, 0.001), (2.0, 0.0005), (1.0, 0.0002), (0.5, 0.0001), (0.2, 0.0001), (0.1, 0.0001), (0.0001, 0.0001)]
 196714: [(0.0001, 5e-05)]
 196715: [(25.0, 0.01), (10.0, 0.005), (5.0, 0.0025), (0.1, 0.001), (0.0005, 0.005)]
}
%

相关问题