python中无定界符数字矩阵的高效读取

kcugc4gi 于 2021-07-13 发布在 Java

关注(0)|答案(1)|浏览(401)

我有一个文件，它包含数字[0-9]矩阵，没有带形状（n，m）的分隔符。n约为5万，m约为5万。例如，矩阵文件的小版本是， mat.txt ```
0012230012000
0012230002300
0012230004200

现在我正在使用下面的代码，但是我对速度不是很满意。

def read_int_mat(path):
"""
Read a matrix of integer with [0-9], and with no delimiter.
"""
with open(path) as f:
mat = np.array(
[np.array([int(c) for c in line.strip()]) for line in f.readlines()],
dtype=np.int8,
)
return mat

编辑：这里有一个小基准

import numpy as np
def read_int_mat(path):
"""
Read a matrix of integer with [0-9], and with no delimiter.
"""
with open(path) as f:
mat = np.array(
[np.array([int(c) for c in line.strip()]) for line in f.readlines()],
dtype=np.int8,
)
return mat

%timeit read_int_mat("mat.txt")
%timeit np.genfromtxt("mat.txt", delimiter=1, dtype="int8")

print(read_int_mat("mat.txt"))
print(np.genfromtxt("mat.txt", delimiter=1, dtype="int8"))

输出为：

61.6 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
327 µs ± 4.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
[[0 0 1 2 2 3 0 0 1 2 0 0 0]
[0 0 1 2 2 3 0 0 0 2 3 0 0]
[0 0 1 2 2 3 0 0 0 4 2 0 0]]
[[0 0 1 2 2 3 0 0 1 2 0 0 0]
[0 0 1 2 2 3 0 0 0 2 3 0 0]
[0 0 1 2 2 3 0 0 0 4 2 0 0]]

有什么我可以试着加快速度的吗。cython能帮忙吗？非常感谢。

python performance Matrix

来源：https://stackoverflow.com/questions/67289780/efficiently-reading-digit-matrix-without-delimiter-in-python

1条答案

按热度按时间

jw5wzhpr1#

你可以用 np.genfromtxt ，例如：
文件（13列）：

0012230012000
0012230002300
0012230004200

然后：

x = np.genfromtxt("file.txt", delimiter=1, dtype="int8")
print(x)

印刷品：

[[0 0 1 2 2 3 0 0 1 2 0 0 0]
 [0 0 1 2 2 3 0 0 0 2 3 0 0]
 [0 0 1 2 2 3 0 0 0 4 2 0 0]]

编辑：版本 np.fromiter 以二进制模式打开文件：

def read_npfromiter(path):
    with open(path, "rb") as f:
        return np.array(
            [np.fromiter((chr(c) for c in l.strip()), dtype="int8") for l in f],
        )

带形状的文件基准 (168, 9360) :

from timeit import timeit

def read_int_mat(path):
    """
    Read a matrix of integer with [0-9], and with no delimiter.
    """
    with open(path, "r") as f:
        mat = np.array(
            [
                np.array([int(c) for c in line.strip()])
                for line in f.readlines()
            ],
            dtype=np.int8,
        )
    return mat

def read_npfromiter(path):
    with open(path, "rb") as f:
        return np.array(
            [np.fromiter((chr(c) for c in l.strip()), dtype="int8") for l in f],
        )

def f1(f):
    return np.genfromtxt(
        f, delimiter=1, dtype="int8", autostrip=False, encoding="ascii"
    )

def f2(f):
    return read_int_mat(f)

def f3(f):
    return read_npfromiter(f)

t1 = timeit(lambda: f1("file.txt"), number=1)
t2 = timeit(lambda: f2("file.txt"), number=1)
t3 = timeit(lambda: f3("file.txt"), number=1)

print(t1)
print(t2)
print(t3)

结果：

1.0680423599551432
0.28135157003998756
0.19099885696778074

赞(0）回复(0）举报 2021-07-13

我来回答

python中无定界符数字矩阵的高效读取

1条答案

相关问题

热门标签

最新问答