如何使用Python中的模板解析文本文件中的数值表？

h5qlskok 于 2022-12-10 发布在 Python

关注(0)|答案(2)|浏览(206)

我想从一个文本文件中提取一系列表格。该文件看起来如下所示。表格标题遵循规则的模式，并且在表格的末尾有一个空行。最终我希望表格位于Numpy数组中，但如果我能将数字数据行隔离开来，那么转换为数组就很容易了。
example.txt的内容：

lines to ignore

Table AAA

   -  ----
   1  3.5
   3  6.8
  55  9.933

more lines to ignore
more lines to ignore

Table BBB

   -  ----
   2  5.0
   5  6.8
  99  9.933

even more lines to ignore

(Edit：在列中的行前添加空格）
从这里，我想要一个列表，类似于：

[ 
   { 'id' : 'AAA', data : [[1,3.5],[3,6.8],[5,9.933]]},
   { 'id' : 'BBB', data : [[2,5.0],[5,6.8],[99,9.933]]},
]

我已经为此编写了很多一次性的解析器，但是我想基于我在ttp Python包中看到的内容使用模板做一些事情。对我来说不幸的是，那个包似乎专注于网络配置文件，所以没有一个例子与我想要做的事情很接近。
如果有更好的Python包可以使用，我愿意接受建议。
下面是我的开场白：

import ttp

template = """
<group name="table data" method="table">

Table {{ tab_name }}
{{ x1 | ROW }}

</group>
"""

lines = ''.join(open('example.txt').readlines())

parser = ttp.ttp(data=lines, template=template)
parser.parse()

res = parser.result()
print(res)

但这不会分隔表格或忽略分散的文本行。

In [11]: res
Out[11]:
[[{'table data': [{'x1': 'lines to ignore'},
    {'tab_name': 'AAA'},
    {'x1': '-  ----'},
    {'x1': '1  3.5'},
    {'x1': '3  6.8'},
    {'x1': '5  9.933'},
    {'x1': 'more lines to ignore'},
    {'x1': 'more lines to ignore'},
    {'tab_name': 'BBB'},
    {'x1': '-  ----'},
    {'x1': '2  5.0'},
    {'x1': '5  6.8'},
    {'x1': '99  9.933'},
    {'x1': 'even more lines to ignore'}]}]]

python

来源：https://stackoverflow.com/questions/74732334/how-do-i-parse-numeric-tables-from-a-text-file-using-templates-in-python

2条答案

按热度按时间

uurity8g1#

不需要找一个包来完成这项工作，你可以使用正则表达式：

import re

def isolate_tables(text: str) -> dict:
    tables = []

    lines = iter(line.strip() for line in text.split("\n"))

    while True:
        try:
            match_table_name = None
            while match_table_name is None:
                match_table_name = re.match(r"Table\s+(.+)$", next(lines))

            table_name, = match_table_name.groups()
            table_data = []

            tables.append((table_name, table_data))

            match_header = None
            while match_header is None:
                match_header = re.match(r"^[-\s]+$", next(lines))

            match_data_line = True
            while match_data_line:
                match_data_line = re.split("\s+", next(lines))
                if len(match_data_line) > 1:
                    table_data.append(match_data_line)
                else:
                    match_data_line = False
        
        except StopIteration:
            break

    return tables

isolate_tables(example)
# [('AAA', [['1', '3.5'], ['3', '6.8'], ['5', '9.933']]), ('BBB', [['2', '5.0'], ['5', '6.8'], ['99', '9.933']])]

我会让你根据自己的需要调整输出

赞(0）回复(0）举报 2022-12-10

zqdjd7g92#

希望这能有点帮助：
这里我试着用宏写一个python函数，我检查行的第一个字符，如果它的数字那么过滤器else只是返回False，我们可以忽略该行。但是当我尝试'AAA'消失，所以我返回''。在使用exclude_瓦尔如果它是空的。
此外，为了更多的清理，你可以从这里使用ttp函数。

已更新

import ttp

template = """
<macro>
def check_number(data):
    if data[:1].isdigit():
        return ' '.join(data.split(' ')).split()
    else:
        return ''

</macro>

<vars>
   if_D_empty=''
</vars>

<group name="TABLE.{{id}}" method="table" exclude_val="Data, if_D_empty">

Table {{ id}}

{{Data | ROW | macro("check_number")}}

</group>
"""

lines = ''.join(open('t1.txt').readlines())

parser = ttp.ttp(data=lines, template=template)
parser.parse()

res = parser.result(format='json')[0]
print(res)

输出量：

另外，要获得输出，我想你需要设法给予你的表

标签，这可以创建

，

.