tensorflow Big SparseTensor constant和Datasets要么变成一个大图，要么执行速度很慢,

yv5phkfx 于 5个月前发布在其他

关注(0)|答案(8)|浏览(152)

系统信息

是否编写了自定义代码(与使用TensorFlow提供的库存示例脚本相反):是
OS平台和发行版(例如，Linux Ubuntu 16.04):Debian Testing/Sid
从哪里安装的TensorFlow(从源代码还是二进制文件):二进制文件
TensorFlow版本(请使用以下命令):2.3.0
Python版本：3.8.5
Bazel版本(如果从源代码编译):
GCC/编译器版本(如果从源代码编译):
CUDA/cuDNN版本：10.1.243/7.6.5
GPU型号和内存：Quadro T2000

我通过numpy.load加载了一个稀疏矩阵，形状为[16777216 5416537],有335548396个值(这是DrQA的TFIDF矩阵)。我想在计算过程中使用一个数据集并使用该矩阵。我已经尝试了不同的方法，一种非常慢，另一种则达到2GB限制(我猜优化器正在展开常量矩阵并过度填充图，但我不确定，而且我在使用Grappler配置进行测试时也没有成功)。
慢速版本(我有32GB和16个核心可用)如下所示：

indices, values, dense_shape = load_parts_of_sparse_matrix_as_np_arrays()

def f(x, indices, values, dense_shape):
  m = tf.SparseTensor(indices, values, dense_shape=dense_shape)
  return do_something(x, m)

ds = load_ds()

ds = tf.data.Dataset.zip((
  ds,
  tf.data.Dataset.from_tensors(indices).repeat(),
  tf.data.Dataset.from_tensors(values).repeat(),
  tf.data.Dataset.from_tensors(dense_shape).repeat(),
))

ds.map(f)

我猜每次迭代都在复制内存。
以下修改达到了2gb限制

indices, values, dense_shape = load_parts_of_sparse_matrix_as_np_arrays()

m = tf.SparseTensor(indices, values, dense_shape=dense_shape)

def f(x):
  return do_something(x, m)

ds = load_ds()

ds.map(f)

这对我来说有点难以理解。我唯一的猜测是，TF将传递给map的函数用tf.function Package ，然后跟踪该函数，检测到常量并以某种方式过度填充图形。我关闭了constant_folding没有成功。
到目前为止，唯一让我获得好结果的方法是禁用急切执行并使用占位符表示稀疏Tensor的部分，在其他任何方式下都会达到2GB的结果。遗憾的是，当我使用数据集时，我无法控制这一点。
我首先尝试在Stack Overflow上寻求帮助，但迄今为止还没有收到答复。

tensorflow

来源：https://github.com/tensorflow/tensorflow/issues/46089

8条答案

按热度按时间

afdcj2ne1#

Just to add, I am using tf.sparse.sparse_dense_matmul to multiply a dense tensor and a sparse one. I was previously using a py_function and the sparse representation of the matrix was kept in the Python side with scipy.sparse . This was fast (between 10-20 iterations per second.) I was expecting to have similar results in TF, but moving the constant matrix to the graph results in an almost 80-160x slower performance (between 4 to 8 seconds per iteration).
I can still use the py_function in my application, but I guess there has to be a way to make TF faster than scipy for this task.

赞(0）回复(0）举报 5个月前

ifsvaxew2#

额外的想法：我意识到慢速版本之所以更慢，是因为我使用参数 ds.map 运行了 num_parallel_calls=32 。这导致了内存瓶颈。如果我将32减少到4,它会运行得更快(不明显的内存瓶颈),但与此相比仍然非常缓慢。

赞(0）回复(0）举报 5个月前

fzsnzjdm3#

你好，@jorgeecardona,

第二种方法应该能够高效地工作，只要稀疏索引/值Tensor可以适应图的大小。由SparseTensor表示的密集矩阵的大小不应该有关系，因为只有索引/值/密集形状被序列化到图中。

如果索引/值太大，超过了2GB的图大小限制，我认为你可以通过提供一个生成器数据集来解决这个问题：

SIZE = 1000000
indices, values, dense_shape = [[i, i] for i in range(SIZE)], list(range(SIZE)), [SIZE, SIZE]
m = tf.SparseTensor(indices, values, dense_shape=dense_shape)

def gen():
  yield m

def do_something(x, m):
  return m

ds = tf.data.Dataset.range(10).repeat()
ds_m = tf.data.Dataset.from_generator(gen, output_signature=tf.SparseTensorSpec.from_value(m)).cache().repeat()

ds = tf.data.Dataset.zip((ds, ds_m))
ds = ds.map(do_something)

赞(0）回复(0）举报 5个月前

4dbbbstv4#

你好@aaudiber,感谢你的回答。我更新到了TF 2.4(在2.3版本中我无法使用output_signature参数),但性能仍然较低。数字与第一个实现非常相似，当我将num_parallel_calls设置为16时，我觉得内存瓶颈也类似。
这是否是稀疏乘法(sparse_csr_matrix_ops.sparse_matrix_sparse_mat_mul)的不良实现？目前基本上比使用py_function和scipy.sparse慢了10倍。
最好的。

赞(0）回复(0）举报 5个月前

bzzcjhmw5#

嗯，生成器数据集的每个repeat()都会再次调用生成器函数，这可能会导致稀疏Tensor被复制。你能尝试在.repeat()之前为from_generator数据集添加一个.cache()变换吗？我刚刚编辑了上面的例子，在正确的位置插入了.cache()。

赞(0）回复(0）举报 5个月前

q8l4jmvw6#

你好，我没有看到缓存的改进。这是代码的一部分，以防它有所帮助：

def closest_docs_2(ds, indices, values, dense_shape, deterministic=True):

    b = tf.SparseTensor(indices, values, dense_shape=dense_shape)

    def b_gen():
        yield b

    def f(x, b):

        x = tf.sparse.reshape(x, [1, -1])

        x_sm = sparse_csr_matrix_ops.sparse_tensor_to_csr_sparse_matrix(
            x.indices, x.values, x.dense_shape)

        b_sm = sparse_csr_matrix_ops.sparse_tensor_to_csr_sparse_matrix(
            b.indices, b.values, b.dense_shape)

        c_sm = sparse_csr_matrix_ops.sparse_matrix_sparse_mat_mul(
            a=x_sm, b=b_sm, type=tf.float32)

        c = sparse_csr_matrix_ops.csr_sparse_matrix_to_sparse_tensor(
            c_sm, tf.float32
        )

        return tf.gather(
            c.indices[:,1], tf.argsort(c.values, direction='DESCENDING')
        )

        return tf.argsort(values, direction='DESCENDING')

    ds = tf.data.Dataset.zip((
        ds,
        tf.data.Dataset.from_generator(
            b_gen, output_signature=tf.SparseTensorSpec.from_value(b)
        ).cache().repeat()
    ))

    # Find the closests documents.
    return ds.map(f, num_parallel_calls=8, deterministic=deterministic)

使用scipy的代码在精神上并没有什么不同。

赞(0）回复(0）举报 5个月前

s1ag04yj7#

你好，

我仍然遇到这个问题，我能够用较少的值(如之前测试的那个)乘以一些稀疏矩阵，并且在 TensorFlow(CPU 和 GPU)上获得了更好的性能，比 scipy 更好。但是当我尝试使用一个类似的矩阵(与最初测试的矩阵具有相同数量的值)时，我遇到了 protobuf 限制：

[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/wire_format_lite.cc:504] CHECK failed: (value.size()) <= (kint32max): 
Segmentation fault

我现在使用的是 TensorFlow 2.4.1。
如果在 dataset.map 的每次迭代中返回带有矩阵的 sparsetensor,它仍然可以工作，但这非常慢。如果在 dataset.map 之前创建稀疏Tensor，我就会遇到限制(我认为 map 跟踪函数并将 eager sparse tensor 注入到图中)。

赞(0）回复(0）举报 5个月前

wqsoz72f8#

你好，
感谢你打开这个问题。由于这个问题已经开放了很长时间，这个问题的代码/调试信息可能与当前代码库的状态不相关。
Tensorflow团队正在不断通过修复错误和添加新功能来改进框架。我们建议你尝试使用最新的TensorFlow version 和最新的兼容硬件配置，这可能会解决该问题。如果你仍然遇到问题，请创建一个新的GitHub问题，附上你的最新发现以及所有有助于我们调查的调试信息。
请按照 release notes 了解Tensorflow空间中最新发展的动态。

赞(0）回复(0）举报 5个月前

我来回答

tensorflow Big SparseTensor constant和Datasets要么变成一个大图，要么执行速度很慢,

8条答案

相关问题

热门标签

最新问答