使用pandas(“task-spooler”)阅读复杂的表

o7jaxewo  于 9个月前  发布在  其他
关注(0)|答案(1)|浏览(49)

我有下表,这是task-spooler的输出。
它很容易为人类解析,但我有麻烦阅读到PandasDF。
有什么想法吗?

ID   State      Output               E-Level  Times(r/u/s)   Command [run=1/2]
6    running    /tmp/ts-out.FzVneG                           [l1]python infloop.py
0    finished   /tmp/ts-out.ixWHm2   0        0.00/0.00/0.00 bash -c echo 1
1    finished   /tmp/ts-out.ZzwS11   0        0.00/0.00/0.00 bash -c echo 1
2    finished   /tmp/ts-out.GJlyge   2        0.00/0.00/0.00 bash -c
4    finished   /tmp/ts-out.lIVMYH   2        0.00/0.00/0.00 bash -c -h
5    finished   /tmp/ts-out.8EKHy1   -1       141.23/0.00/0.00 python infloop.py
3    finished   /tmp/ts-out.lBr4Wy   -1       2545.36/0.00/0.02 bash -c python infloop.py
7    finished   /tmp/ts-out.kxCczi   2        0.01/0.00/0.00 bash -c
8    finished   /tmp/ts-out.3VkfNh   0        0.00/0.00/0.00 echo
9    finished   /tmp/ts-out.8ewxzl   0        0.01/0.00/0.00 echo
10   finished   /tmp/ts-out.ahSLaY   0        0.00/0.00/0.00 bash -c echo $GPUID
11   finished   /a/home/cc/cs/yuvval/tmp/ts-out.3dpaBO 0        0.00/0.00/0.00 bash -c ls
12   finished   /tmp/ts-out.ADWkve   0        0.00/0.00/0.00 bash -c ls
13   finished   /a/home/cc/cs/yuvval/tmp/ts-out.xm0jtn -1       130.67/0.00/0.02 bash -c python infloop.py
14   finished   /tmp/ts-out.HxBqkm   0        0.00/0.00/0.00 bash -c echo 11
15   finished   /tmp/ts-out.ERNuaE   0        0.00/0.00/0.00 bash -c echo 
16   finished   /tmp/ts-out.9j6hkS   0        0.00/0.00/0.00 bash -c echo $GPUID
17   finished   /tmp/ts-out.Y5QDNa   0        0.00/0.00/0.00 bash -c echo $GPUID
18   finished   /tmp/ts-out.EIHhoX   -1       0.00/0.00/0.00 %s
19   finished   /tmp/ts-out.LLw2Wl   -1       0.00/0.00/0.00 
20   finished   /tmp/ts-out.deWAJR   -1       0.01/0.00/0.00 echo $GPUID
21   finished   /tmp/ts-out.AdZFIf   -1       0.00/0.00/0.00 echo 12
22   finished   /tmp/ts-out.NBOCVv   0        0.00/0.00/0.00 echo 12
23   finished   /tmp/ts-out.5WpfPu   0        0.00/0.00/0.00 echo
24   finished   /tmp/ts-out.1lw4bS   -1       0.00/0.00/0.00 echo 
25   finished   /tmp/ts-out.7MNGLQ   0        0.00/0.00/0.00 bash -c echo $GPUID
26   finished   /tmp/ts-out.8FZ3on   0        0.00/0.00/0.00 bash -c echo $GPUID

我最好的尝试是:

from StringIO import StringIO as sIO
std = ... # the table text
pd.read_table(sIO(std), sep='\s+', engine='python')

错误代码:
ValueError:第2行中预期有7个字段,结果为9
生成该表的源代码是可用的。下面是生成每一行的命令。这可以帮助阅读表到一个框架吗?

if (p->label)
    snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s[%s]"
            "%s\n",
            p->jobid,
            jobstate,
            output_filename,
            p->result.errorlevel,
            p->result.real_ms,
            p->result.user_ms,
            p->result.system_ms,
            dependstr,
            p->label,
            p->command);
else
    snprintf(line, maxlen, "%-4i %-10s %-20s %-8i %0.2f/%0.2f/%0.2f %s%s\n",
            p->jobid,
            jobstate,
            output_filename,
            p->result.errorlevel,
            p->result.real_ms,
            p->result.user_ms,
            p->result.system_ms,
            dependstr,
            p->command);
qoefvg9y

qoefvg9y1#

这有点烦人,但由于输出日志中的分隔符不一致(有时是多个空格,有时是制表符,最后一列通常只有一个空格),因此在使用pandas解析文件之前,如果没有任何额外的逻辑,很难解析。我个人不喜欢在python中打开文件来修复它,然后用pandas加载它,所以我只需要在python中加载文件之前添加一个简短的sed命令到我的管道中(如果你使用Linux并且日志文本是从文件中加载的,这非常简单)。您可以添加:

cat logfile.log | sed -r 's/\s\s+/,/g' | sed -e 's/\([[:digit:]].[[:digit:]]\{2\}\) /\1,/' > logfile.csv

然后,您只需用逗号替换所有空格以及最后一个有问题的空格。然后,文本从:

ID   State      Output               E-Level  Times(r/u/s)   Command [run=1/2]
6    running    /tmp/ts-out.FzVneG                           [l1]python infloop.py
0    finished   /tmp/ts-out.ixWHm2   0        0.00/0.00/0.00 bash -c echo 1
1    finished   /tmp/ts-out.ZzwS11   0        0.00/0.00/0.00 bash -c echo 1
2    finished   /tmp/ts-out.GJlyge   2        0.00/0.00/0.00 bash -c
4    finished   /tmp/ts-out.lIVMYH   2        0.00/0.00/0.00 bash -c -h
5    finished   /tmp/ts-out.8EKHy1   -1       141.23/0.00/0.00 python infloop.py
3    finished   /tmp/ts-out.lBr4Wy   -1       2545.36/0.00/0.02 bash -c python infloop.py
7    finished   /tmp/ts-out.kxCczi   2        0.01/0.00/0.00 bash -c
8    finished   /tmp/ts-out.3VkfNh   0        0.00/0.00/0.00 echo

对此:

ID,State,Output,E-Level,Times(r/u/s),Command [run=1/2]
6,running,/tmp/ts-out.FzVneG,[l1]python infloop.py
0,finished,/tmp/ts-out.ixWHm2,0,0.00/0.00/0.00,bash -c echo 1
1,finished,/tmp/ts-out.ZzwS11,0,0.00/0.00/0.00,bash -c echo 1
2,finished,/tmp/ts-out.GJlyge,2,0.00/0.00/0.00,bash -c
4,finished,/tmp/ts-out.lIVMYH,2,0.00/0.00/0.00,bash -c -h
5,finished,/tmp/ts-out.8EKHy1,-1,141.23/0.00/0.00,python infloop.py
3,finished,/tmp/ts-out.lBr4Wy,-1,2545.36/0.00/0.02,bash -c python infloop.py
7,finished,/tmp/ts-out.kxCczi,2,0.01/0.00/0.00,bash -c
8,finished,/tmp/ts-out.3VkfNh,0,0.00/0.00/0.00,echo

然后将其作为CSV加载到pandas中:

import pandas as pd
my_df = pd.read_csv(my_log_file)

很抱歉,这不是一个有趣的纯python解决方案,但在我看来,bash部分使python部分变得更容易。

相关问题