我正试图在NVIDIA董事会(Jetson TX2)上运行一个深度学习计划。这台机器有大约8 GB的内存和128 GB的交换分区。
Ubuntu 18.04在我的机器上运行。
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
$ uname -a
Linux nvidia 4.9.201-tegra #1 SMP PREEMPT Fri Jan 15 14:54:23 PST 2021 aarch64 aarch64 aarch64 GNU/Linux
我的机器上大约有8+128 GB的内存可用。
$ free -h
total used free shared buff/cache available
Mem: 7.7G 1.7G 5.7G 3.6M 266M 6.7G
Swap: 125G 384M 125G
并且,虚拟内存设置不受限制。
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 28396
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 28396
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
然而,当我运行深度学习程序时,它产生了OOM错误和死机。
$ ./build/testbed --mode nerf --scene data/nerf/fox/
15:28:56 INFO Loading NeRF dataset from
15:28:56 INFO data/nerf/fox/transforms.json
15:28:57 SUCCESS Loaded 50 images after 0s
15:28:57 INFO cam_aabb=[min=[1.0229,-1.33309,-0.378748], max=[2.46175,1.00721,1.41295]]
15:28:59 INFO Loading network config from: configs/nerf/base.json
15:28:59 INFO GridEncoding: Nmin=16 b=1.51572 F=2 T=2^19 L=16
Warning: FullyFusedMLP is not supported for the selected architecture 62. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 62. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 62. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
15:28:59 INFO Density model: 3--[HashGrid]-->32--[FullyFusedMLP(neurons=64,layers=3)]-->1
15:28:59 INFO Color model: 3--[Composite]-->16+16--[FullyFusedMLP(neurons=64,layers=4)]-->3
15:28:59 INFO total_encoding_params=13074912 total_network_params=9728
GPUMemoryArena: Warning: GPU 0 does not support virtual memory. Falling back to regular allocations, which will be larger and can cause occasional stutter.
Killed
以下是系统日志消息
[20539.993902] testbed invoked oom-killer: gfp_mask=0x24082c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
[20540.007325] testbed cpuset=/ mems_allowed=0
[20540.007336] CPU: 0 PID: 10005 Comm: testbed Not tainted 4.9.201-tegra #1
[20540.007338] Hardware name: quill (DT)
[20540.007340] Call trace:
[20540.007348] [<ffffff800808b9f8>] dump_backtrace+0x0/0x198
[20540.007352] [<ffffff800808bfbc>] show_stack+0x24/0x30
[20540.007356] [<ffffff800845abe8>] dump_stack+0xa0/0xc8
[20540.007361] [<ffffff8008257c54>] dump_header+0x6c/0x1b8
[20540.007366] [<ffffff80081c843c>] oom_kill_process+0x29c/0x4c8
[20540.007369] [<ffffff80081c8b14>] out_of_memory+0x1e4/0x308
[20540.007372] [<ffffff80081ce608>] __alloc_pages_nodemask+0x810/0xcb8
[20540.007376] [<ffffff800857d884>] nvmap_alloc_pages_exact+0x54/0xe8
[20540.007378] [<ffffff800857ecd8>] nvmap_alloc_handle+0xad0/0xfd0
[20540.007382] [<ffffff800858b104>] nvmap_ioctl_alloc+0xdc/0x118
[20540.007384] [<ffffff80085854d4>] nvmap_ioctl+0xc4/0x5f0
[20540.007387] [<ffffff8008271f58>] do_vfs_ioctl+0xb0/0x8d8
[20540.007389] [<ffffff800827280c>] SyS_ioctl+0x8c/0xa8
[20540.007392] [<ffffff8008083900>] el0_svc_naked+0x34/0x38
[20540.007418] Mem-Info:
[20540.007426] active_anon:669 inactive_anon:671 isolated_anon:0
active_file:458 inactive_file:353 isolated_file:0
unevictable:3880 dirty:0 writeback:0 unstable:0
slab_reclaimable:10453 slab_unreclaimable:18311
mapped:304 shmem:1 pagetables:2384 bounce:0
free:165587 free_pcp:211 free_cma:150521
[20540.007431] Node 0 active_anon:2676kB inactive_anon:2684kB active_file:1832kB inactive_file:1412kB unevictable:15520kB isolated(anon):0kB isolated(file):0kB mapped:1216kB dirty:0kB writeback:0kB shmem:4kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:524 all_unreclaimable? no
[20540.007438] DMA free:628456kB min:11484kB low:14352kB high:17220kB active_anon:1324kB inactive_anon:292kB active_file:1624kB inactive_file:1236kB unevictable:0kB writepending:0kB present:2086900kB managed:2060924kB mlocked:0kB slab_reclaimable:4kB slab_unreclaimable:240kB kernel_stack:96kB pagetables:12kB bounce:0kB free_pcp:704kB local_pcp:0kB free_cma:602084kB
[20540.007439] lowmem_reserve[]: 0 5837 5837 5837
[20540.007450] Normal free:33892kB min:33568kB low:41960kB high:50352kB active_anon:1428kB inactive_anon:2288kB active_file:236kB inactive_file:224kB unevictable:15520kB writepending:0kB present:6125568kB managed:5977956kB mlocked:16kB slab_reclaimable:41808kB slab_unreclaimable:73004kB kernel_stack:8048kB pagetables:9524kB bounce:0kB free_pcp:140kB local_pcp:0kB free_cma:0kB
[20540.007451] lowmem_reserve[]: 0 0 0 0
[20540.007458] DMA: 154*4kB (UC) 94*8kB (UMC) 71*16kB (MC) 37*32kB (UMC) 11*64kB (MC) 5*128kB (UMC) 4*256kB (C) 2*512kB (UC) 1*1024kB (C) 1*2048kB (C) 151*4096kB (UC) = 628648kB
[20540.007488] Normal: 279*4kB (UME) 167*8kB (UME) 153*16kB (UM) 121*32kB (UM) 68*64kB (UM) 21*128kB (M) 72*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 34244kB
[20540.007513] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[20540.007514] 5320 total pagecache pages
[20540.007517] 571 pages in swap cache
[20540.007519] Swap cache stats: add 291106, delete 290668, find 14642/77423
[20540.007521] Free swap = 131534904kB
[20540.007522] Total swap = 132019420kB
[20540.007524] 2053117 pages RAM
[20540.007525] 0 pages HighMem/MovableOnly
[20540.007527] 43397 pages reserved
[20540.007528] 188416 pages cma reserved
[20540.007529] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[20540.007555] [ 2204] 0 2204 4576 8 12 4 235 0 systemd-journal
[20540.007562] [ 2833] 0 2833 3909 9 7 3 406 -1000 systemd-udevd
[20540.007569] [ 4637] 62583 4637 20992 0 11 4 170 0 systemd-timesyn
[20540.007572] [ 4662] 0 4662 1920 2 8 3 798 0 haveged
[20540.007576] [ 4664] 0 4664 1602 0 7 3 131 0 rpcbind
[20540.007579] [ 4670] 102 4670 2612 0 9 4 173 0 systemd-resolve
[20540.007582] [ 4871] 0 4871 59300 0 18 5 319 0 accounts-daemon
[20540.007585] [ 4902] 0 4902 78223 0 20 4 428 0 ModemManager
[20540.007589] [ 4939] 114 4939 1541 0 7 3 146 0 avahi-daemon
[20540.007594] [ 4968] 109 4968 54993 8 12 3 863 0 rsyslogd
[20540.007599] [ 5005] 103 5005 2065 6 7 3 451 -900 dbus-daemon
[20540.007603] [ 5144] 114 5144 1491 0 7 3 84 0 avahi-daemon
[20540.007607] [ 5176] 0 5176 2418 0 8 4 244 0 wpa_supplicant
[20540.007611] [ 5177] 0 5177 101135 0 31 5 1004 0 NetworkManager
[20540.007615] [ 5191] 0 5191 97825 0 23 4 1180 0 udisksd
[20540.007620] [ 5207] 0 5207 1660 2 8 4 70 0 cron
[20540.007624] [ 5215] 0 5215 2638 0 10 4 208 0 systemd-logind
[20540.007628] [ 5367] 0 5367 59266 2 18 3 753 0 polkitd
[20540.007635] [ 5632] 111 5632 62056 0 23 5 517 0 whoopsie
[20540.007640] [ 5682] 0 5682 269941 0 58 7 5052 0 containerd
[20540.007643] [ 5715] 105 5715 2372 0 8 4 118 0 kerneloops
[20540.007647] [ 5755] 0 5755 1318 0 6 3 33 0 agetty
[20540.007651] [ 5763] 105 5763 2372 0 8 4 117 0 kerneloops
[20540.007654] [ 5792] 0 5792 2604 0 10 4 192 -1000 sshd
[20540.007657] [ 5828] 0 5828 1318 0 6 4 36 0 agetty
[20540.007662] [ 5829] 0 5829 1654 2 8 3 77 0 nvmemwarning.sh
[20540.007666] [ 5834] 0 5834 5375 0 7 3 68 0 nvs-service
[20540.007671] [ 5929] 0 5929 21197 0 29 3 1253 0 nvargus-daemon
[20540.007674] [ 6027] 0 6027 58903 0 18 5 376 0 gdm3
[20540.007679] [ 6240] 0 6240 786 0 4 3 37 0 nvphsd
[20540.007683] [ 6266] 0 6266 3045 0 6 3 88 0 nvphsd
[20540.007687] [ 6306] 0 6306 40899 0 17 3 376 0 gdm-session-wor
[20540.007691] [ 6362] 1000 6362 3344 2 10 4 525 0 systemd
[20540.007695] [ 6381] 1000 6381 4100 0 12 3 758 0 (sd-pam)
[20540.007699] [ 6403] 1000 6403 58867 0 16 4 279 0 gnome-keyring-d
[20540.007704] [ 6407] 1000 6407 39901 2 13 3 168 0 gdm-x-session
[20540.007707] [ 6409] 1000 6409 6350378 530 48 5 3650 0 Xorg
[20540.007711] [ 6415] 1000 6415 1903 5 10 3 363 0 dbus-daemon
[20540.007715] [ 6418] 1000 6418 568 1 5 3 34 0 run-systemd-ses
[20540.007719] [ 6495] 1000 6495 58783 0 16 3 783 0 gvfsd
[20540.007723] [ 6500] 1000 6500 94998 0 19 5 795 0 gvfsd-fuse
[20540.007728] [ 6505] 1000 6505 39575 0 12 3 159 0 gvfsd-metadata
[20540.007732] [ 6556] 1000 6556 1045 3 6 4 80 0 ssh-agent
[20540.007735] [ 6585] 0 6585 2089 2 7 3 317 0 dhclient
[20540.007739] [ 6669] 1000 6669 3206 0 10 4 173 0 systemctl
[20540.007742] [ 6680] 1000 6680 203418 208 59 4 2639 0 unity-settings-
[20540.007745] [ 6681] 1000 6681 135849 0 54 4 2882 0 indicator-keybo
[20540.007748] [ 6682] 1000 6682 72068 0 44 4 2588 0 bamfdaemon
[20540.007753] [ 6683] 1000 6683 77439 0 18 4 308 0 indicator-power
[20540.007757] [ 6684] 1000 6684 62377 0 22 4 900 0 indicator-appli
[20540.007762] [ 6685] 1000 6685 266628 0 60 5 1204 0 indicator-datet
[20540.007766] [ 6689] 1000 6689 114566 0 24 4 896 0 indicator-sessi
[20540.007769] [ 6690] 1000 6690 77338 0 18 3 266 0 indicator-bluet
[20540.007773] [ 6691] 1000 6691 78207 0 21 5 440 0 indicator-messa
[20540.007777] [ 6692] 1000 6692 228933 0 26 4 530 0 indicator-sound
[20540.007781] [ 6694] 1000 6694 119581 0 32 4 733 0 gnome-session-b
[20540.007785] [ 6727] 0 6727 1777 0 7 4 105 0 bluetoothd
[20540.007789] [ 6773] 1000 6773 302887 0 36 4 2074 0 pulseaudio
[20540.007793] [ 6777] 106 6777 38092 0 9 4 102 0 rtkit-daemon
[20540.007796] [ 6811] 1000 6811 250673 0 58 4 1217 0 evolution-sourc
[20540.007799] [ 6816] 1000 6816 77007 0 19 3 769 0 at-spi-bus-laun
[20540.007802] [ 6852] 1000 6852 1708 0 8 4 142 0 dbus-daemon
[20540.007805] [ 6861] 1000 6861 39831 0 13 4 735 0 at-spi2-registr
[20540.007808] [ 6868] 1000 6868 38763 0 11 3 210 0 dconf-service
[20540.007811] [ 6876] 0 6876 61833 0 18 3 329 0 upowerd
[20540.007814] [ 6926] 1000 6926 211543 1978 99 5 7600 0 compiz
[20540.007817] [ 6931] 1000 6931 129181 0 51 3 2071 0 goa-daemon
[20540.007820] [ 6939] 1000 6939 110468 287 38 4 1561 0 unity-panel-ser
[20540.007824] [ 6946] 1000 6946 144056 2 70 5 10038 0 evolution-calen
[20540.007827] [ 6961] 1000 6961 59960 0 19 4 361 0 goa-identity-se
[20540.007830] [ 6973] 116 6973 60995 0 21 4 1310 0 colord
[20540.007833] [ 7055] 1000 7055 240313 0 69 5 3818 0 nautilus-deskto
[20540.007836] [ 7056] 1000 7056 64823 0 29 4 1228 0 polkit-gnome-au
[20540.007840] [ 7057] 1000 7057 101651 0 33 4 1232 0 unity-fallback-
[20540.007843] [ 7118] 1000 7118 113478 0 43 5 1928 0 nm-applet
[20540.007846] [ 7198] 1000 7198 78150 0 21 3 807 0 gvfs-udisks2-vo
[20540.007850] [ 7254] 1000 7254 78059 0 19 4 405 0 gvfsd-trash
[20540.007853] [ 7267] 1000 7267 77594 0 18 3 413 0 ibus-daemon
[20540.007856] [ 7300] 1000 7300 57984 2 14 4 218 0 gvfs-goa-volume
[20540.007859] [ 7308] 1000 7308 58938 0 16 4 275 0 ibus-dconf
[20540.007862] [ 7332] 1000 7332 64878 0 30 4 1171 0 ibus-x11
[20540.007866] [ 7345] 1000 7345 58915 0 16 4 249 0 ibus-portal
[20540.007869] [ 7358] 1000 7358 40468 0 15 4 267 0 ibus-engine-sim
[20540.007872] [ 7359] 1000 7359 57953 0 14 3 225 0 gvfs-mtp-volume
[20540.007875] [ 7428] 1000 7428 78010 0 20 4 316 0 gvfs-afc-volume
[20540.007879] [ 7448] 1000 7448 58419 0 15 5 770 0 gvfs-gphoto2-vo
[20540.007882] [ 7512] 1000 7512 233505 0 75 4 10519 0 evolution-calen
[20540.007885] [ 7527] 1000 7527 130059 1 42 4 898 0 evolution-addre
[20540.007888] [ 7540] 1000 7540 203798 0 55 3 1566 0 evolution-addre
[20540.007891] [ 7667] 1000 7667 193125 0 43 5 1137 0 zeitgeist-datah
[20540.007894] [ 7676] 1000 7676 95412 0 19 3 316 0 zeitgeist-daemo
[20540.007898] [ 7682] 1000 7682 66618 0 19 4 833 0 zeitgeist-fts
[20540.007901] [ 7726] 1000 7726 102446 5 35 4 1271 0 update-notifier
[20540.007904] [ 7843] 0 7843 87227 0 35 5 1484 0 packagekitd
[20540.007907] [ 7928] 0 7928 59439 0 18 4 823 0 boltd
[20540.007910] [ 8259] 1000 8259 129503 0 50 3 1614 0 deja-dup-monito
[20540.007913] [ 8328] 1000 8328 91556 2 34 4 1530 0 unity-panel-ser
[20540.007917] [ 9281] 0 9281 3033 2 10 3 244 0 sshd
[20540.007920] [ 9362] 1000 9362 3106 0 10 3 264 0 sshd
[20540.007923] [ 9363] 1000 9363 2019 62 8 4 373 0 bash
[20540.007928] [ 9751] 0 9751 1190 0 6 4 20 0 sleep
[20540.007931] [10005] 1000 10005 4253560 2245 68 6 12484 0 testbed
[20540.007934] Out of memory: Kill process 10005 (testbed) score 0 or sacrifice child
[20540.015806] Killed process 10005 (testbed) total-vm:17014240kB, anon-rss:0kB, file-rss:8980kB, shmem-rss:0kB
[20540.059021] oom_reaper: reaped process 10005 (testbed), now anon-rss:0kB, file-rss:8948kB, shmem-rss:0kB
根据该日志的最后两行,该程序总共需要17 GB内存,但操作系统无法提供足够的空间。我不明白为什么它没有可用,即使它有足够的交换空间。我是不是漏掉了什么?
感谢您的关注!
**ps.**如果您了解CUDA programming,以下信息可能会有所帮助。(我不确定)设备配置文件:
$ /usr/local/cuda-10.2/samples/bin/aarch64/linux/release/deviceQuery -h
/usr/local/cuda-10.2/samples/bin/aarch64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X2"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 7850 MBytes (8231813120 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 1300 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
在Cuda-Memcheck上运行深度学习计划。
$ cuda-memcheck ./build/testbed --mode nerf --scene data/nerf/fox/
========= CUDA-MEMCHECK
========= Program hit cudaErrorDevicesUnavailable (error 46) due to "all CUDA-capable devices are busy or unavailable" on CUDA API call to cudaMalloc.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2fdb04]
========= Host Frame:/usr/local/cuda-10.2/lib64/libcudart.so.10.2 (cudaMalloc + 0x144) [0x3b68c]
=========
15:32:40 ERROR Uncaught exception: Could not allocate memory: /home/nvidia/instant-ngp/dependencies/tiny-cuda-nn/include/tiny-cuda-nn/gpu_memory.h:124 cudaMalloc(&rawptr, n_bytes+DEBUG_GUARD_SIZE*2) failed with error all CUDA-capable devices are busy or unavailable
========= ERROR SUMMARY: 1 error
1条答案
按热度按时间kuhbmx9i1#
在进一步检查后,我得出结论,这是由于警告(GPU内存领域:警告:GPU 0不支持虚拟内存)中所述的GPU内存不足。退回到定期分配,这将更大,并可能导致偶尔卡顿)。
我找到并运行了tegrastats程序,该程序由Tegra版本的Cuda工具包提供。这表明深度学习程序是围绕着GPU内存被杀变满的。
我需要修改程序,以便在GPU内存中加载更少的数据。