问题类型
Bug
你是否在TensorFlow Nightly版本中复现了这个bug?
是的
问题来源
二进制文件
TensorFlow版本
TF2.14
自定义代码
无
OS平台和发行版
Linux Ubuntu 22.04
移动设备
- 无响应*
Python版本
3.10
Bazel版本
- 无响应*
GCC/编译器版本
- 无响应*
CUDA/cuDNN版本
- 无响应*
GPU型号和内存大小
- 无响应*
当前行为?
在一个拥有大约32GB内存的机器上(例如,AWS c7g.4xl),mlperf Resnet50离线推理在TF 2.14和nightly轮子上因内存不足而失败。同样的基准测试在TF 2.13上可以正常运行。我已经将这个问题的根本原因追溯到了引入了跨操作调度器以提高具有并行操作模型的性能的提交。尽管这使得r7g.16xl上的MLPerf Resnet50批处理模式性能提高了15%,但它也使内存占用量增加了2.5倍(从25GB增加到67GB)。
commit d0cb12441747ef9fb14137cb99f0b6a17e22b5e4
Author: David Svantesson <david.svantesson@arm.com>
Date: Tue Jul 25 09:33:40 2023 -0700
PR #61235: Add inter scheduler support on AArch64
Imported from GitHub PR https://github.com/tensorflow/tensorflow/pull/61235
This PR adds support for inter op scheduler in the oneDNN + ACL build. It enables the creation of more than 1 scheduler inside ACL to increase performance of models with parallel ops.
For benchmarked NLP models the average performance increase is 9%, for CV classification models its around 2%.
The below benchmarks were done with the following PR’s applied as patches:
#60026, #60723, #61110, #61114, #61093, #61123
我们需要减少内存占用量,或者让最大限制在运行时设置,类似于LRU缓存容量。
重现问题的独立代码
# install MLcomons inference repo
cd $HOME
git clone https://github.com/mlcommons/inference.git
cd inference
git checkout v2.0
cd inference/loadgen
CFLAGS="-std=c++14" python3 setup.py bdist_wheel
pip3 install dist/*.whl
# download the resnet50 model and the dataset
wget https://zenodo.org/record/2535873/files/resnet50_v1.pb
ck pull repo:ck-env
echo 0 | ck install package --tags=image-classification,dataset,imagenet,aux
echo 1 | ck install package --tags=image-classification,dataset,imagenet,val
cp /CK-TOOLS/dataset-imagenet-ilsvrc2012-aux-from.berkeley/val.txt \
/CK-TOOLS/dataset-imagenet-ilsvrc2012-val-min/val_map.txt
# Run resnet50 inference in offline mode
export DATA_DIR=/CK-TOOLS/dataset-imagenet-ilsvrc2012-val-min
export MODEL_DIR=$HOME/
cd $HOME/inference/vision/classification_and_detection$ ./run_local.sh tf resnet50 cpu --scenario=Offline
相关日志输出
INFO:main:starting TestScenario.Offline
./run_local.sh: line 13: 50519 Killed python python/main.py --profile $profile $common_opt --model $model_path $dataset --output $OUTPUT_DIR $EXTRA_OPS $@
1条答案
按热度按时间8yoxcaq71#
AWS c7g是基于ARM的CPU,因此它可能不使用oneDNN(TensorFlow-MKL)。