提高使用Python解析大文件中的数值数据的速度

jogvjijk 于 2023-04-10 发布在 Python

关注(0)|答案(1)|浏览(104)

我正在阅读Ansys中生成的一个viewfactor文件（使用VFOPT），并在python中将其转换为一个2d数组。我知道最终的viewfactor必须是一个6982*6982数组。
视图因子（文件viewfactor.db，从几MB到几GB）的格式如下：

Ansys Release 2020 R2          Build 20.2  Update 20200601  Format     0
RS3D

Number of Enclosures =        1

Enclosure Number =        1 Number of Surfaces =     6982
Element number =    28868 Face Number     2 TOTAL= 1.0000
  0.0000 0.0000 0.0000 0.0029 0.0056 0.0000 0.0000 0.0105 0.0000 0.0000
  0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
  0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
  (numbers numbers numbers)
  0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
  0.0000 0.0000

Element number =    28869 Face Number     2 TOTAL= 1.0000
  0.0000 0.0000 0.0000 0.0029 0.0056 0.0000 0.0000 0.0105 0.0000 0.0000
  0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
  etc etc

因此，对于第一个元素（元素28868），给出了视图因子编号（一次10个，最后一行可能更少），然后有两行可以忽略，然后给出新的编号，依此类推。
我现在用下面的方式阅读文件：

import numpy as np

# the file reading part here is rather quick, it takes a couple seconds
with open('viewfactor.db', 'r') as f:
    lines = f.read().splitlines()

nfaces = 6982 # I already had this number from before
vf = [[None] * nfaces] * nfaces
row = 0
col = 0

# this is the part that needs optimizing
for line in lines[7:]:
    if line == '': continue
    elif line.startswith('Element number'): # reset counters
        row += 1
        col = 0
    else: # save numbers
        nums = list(map(float, line.split()))
        vf[row][col:col+len(nums)] = nums
        col += len(nums)
vf = np.array(vf)

我用一个350 MB的viewfactor文件测试了它，大约需要50秒（如果需要，我可以提供该文件）。
有没有可能降低执行时间？我可以做一些并行计算魔术？使用C？做没有循环？
感谢您的任何建议！

python

来源：https://stackoverflow.com/questions/75964883/improve-speed-when-parsing-numeric-data-from-big-files-using-python

1条答案

按热度按时间

jjjwad0x1#

我有一些东西，你可以结合到你的代码。下面的这个函数接受任何带有数字的字符串（用空格“”或换行符“\n”分隔），并将其转换为numpy数组。不要用单行填充它，找到所有连续的带有数字的行，然后给予它们。
在我的benchamrk中，转换100行包含10个数字的代码只比np.array(list(map(float, test_string.split())))的单行慢6倍，所以速度至少应该是10倍。
注意1：你需要先把这些行转换成字节，就像我提供的例子一样。也许你可以优化和处理字节。
注2：第一次运行这个函数会很慢，因为编译，找不到正确的字节数组签名。也许@JeromeRichard可以帮助解决这个问题：）（编辑：添加了杰罗姆在评论中建议的签名和修复）
注3：你对这个特定的文件有更多的了解，可能会发现其他需要优化的东西（我想计算空格和新行是没有必要的，你可以把它作为输入提供给函数）。

@nb.jit('(Bytes(uint8, 1, "C"),)', nopython=True)
def numba_str_multiline(txt):
    # zero digit code = 48
    # space code = 32
    # new line code = 10
    # dot code = 46
    n_spaces_and_new_lines = 0
    for i in range(len(txt)):
        char = txt[i]
        if char == 32 or char == 10:
            n_spaces_and_new_lines += 1
    result = np.zeros(n_spaces_and_new_lines)

    current_number = 0.0
    idx_of_current_number = 0
    before_dot = True
    divisor = 10
    for i in range(len(txt)):
        char = txt[i]
        if char == 32 or char == 10:
            result[idx_of_current_number] = current_number
            idx_of_current_number += 1
            divisor = 10
            current_number = 0.0
            before_dot = True  # for cases if there is no dot at all
        elif char == 46:
            before_dot = False
        else:
            if before_dot:
                current_number = current_number * 10.0 + (char - 48)
            else:
                current_number += (char - 48) / divisor
                divisor *= 10

    return result

test_string = "1.2345 678.123400 0.0000 0.0029 0.0056 0.0000 0.0000 0.0105 0.0000 0.0000"
test_string_multiline = """1.2345 678.123400 0.0000 0.0029 0.0056 0.0000 0.0000 0.0105 0.0000 0.0000
""" * 100
test_bytes_multiline = bytes(test_string_multiline, 'utf-8')
result = numba_str_multiline(test_bytes_multiline)

赞(0）回复(0）举报 2023-04-10

我来回答

提高使用Python解析大文件中的数值数据的速度

1条答案

相关问题

热门标签

最新问答