vllm 当使用TP 8运行HIPGraph时出现错误,

n7taea2i  于 2个月前  发布在  其他
关注(0)|答案(4)|浏览(81)

运行的命令:

python benchmark_throughput.py -tp 8 --model meta-llama_Llama-2-70b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json

错误日志:

...
ensor_parallel_size=8, quantization=None, enforce_eager=False, seed=0)                                                                                                                                      
(RayWorkerVllm pid=343834) WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:                                                                                               
(RayWorkerVllm pid=343834)     PyTorch 2.1.1+cu121 with CUDA 1201 (you have 2.1.1+rocm5.6)                                                                                                                   
(RayWorkerVllm pid=343834)     Python  3.10.13 (you have 3.10.13)                                                                                                                                            
(RayWorkerVllm pid=343834)   Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)                                                                                
(RayWorkerVllm pid=343834)   Memory-efficient attention, SwiGLU, sparse and more won't be available.                                                                                                         
(RayWorkerVllm pid=343834)   Set XFORMERS_MORE_DETAILS=1 for more details                                                                                                                                    
INFO 12-20 08:54:51 llm_engine.py:223] # GPU blocks: 64920, # CPU blocks: 6553                                                                                                                               
(RayWorkerVllm pid=343834) INFO 12-20 08:54:51 model_runner.py:394] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode,
 set 'enforce_eager=True' or use '--enforce-eager' in the CLI.                                                                                                                                               
(RayWorkerVllm pid=343834) [W HIPGraph.cpp:146] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())                                                        
(RayWorkerVllm pid=343839) WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disabl
e log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)                                                                            
(RayWorkerVllm pid=343839)     PyTorch 2.1.1+cu121 with CUDA 1201 (you have 2.1.1+rocm5.6) [repeated 7x across cluster]                                                                                      
(RayWorkerVllm pid=343839)     Python  3.10.13 (you have 3.10.13) [repeated 7x across cluster]                                                                                                               
(RayWorkerVllm pid=343839)   Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) [repeated 7x across cluster]                                                   
(RayWorkerVllm pid=343839)   Memory-efficient attention, SwiGLU, sparse and more won't be available. [repeated 7x across cluster]                                                                            
(RayWorkerVllm pid=343839)   Set XFORMERS_MORE_DETAILS=1 for more details [repeated 7x across cluster]                                                                                                       
(RayWorkerVllm pid=343834) INFO 12-20 08:55:26 model_runner.py:437] Graph capturing finished in 35 secs.                                                                                                     
(RayWorkerVllm pid=343837) INFO 12-20 08:54:51 model_runner.py:394] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode,
 set 'enforce_eager=True' or use '--enforce-eager' in the CLI. [repeated 7x across cluster]                                                                                                                  
Processed prompts:  48%|█████████████████████████████████████████████████████████████████████▉                                                                            | 479/1000 [03:28<03:53,  2.23it/s]
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed. 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.         [88/499]
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed.                 
(RayWorkerVllm pid=343836) :0:rocdevice.cpp            :2778: 8142196379168 us: 343836: [tid:0x7ed990ff9640] Callback: Queue 0x7ed86a000000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operat
ion resulted in a hardware exception. code: 0x1016                                                                                                                                                           
(RayWorkerVllm pid=343836) *** SIGABRT received at time=1703062736 on cpu 75 ***                                                                                                                             
(RayWorkerVllm pid=343836) PC: @     0x7f0a6505aa7c  (unknown)  pthread_kill                                                                                                                                 
(RayWorkerVllm pid=343836)     @     0x7f0a65006520  (unknown)  (unknown)                                                                                                                                    
(RayWorkerVllm pid=343836) [2023-12-20 08:58:56,718 E 343836 345039] logging.cc:361: *** SIGABRT received at time=1703062736 on cpu 75 ***                                                                   
(RayWorkerVllm pid=343836) [2023-12-20 08:58:56,718 E 343836 345039] logging.cc:361: PC: @     0x7f0a6505aa7c  (unknown)  pthread_kill                                                                       
(RayWorkerVllm pid=343836) [2023-12-20 08:58:56,719 E 343836 345039] logging.cc:361:     @     0x7f0a65006520  (unknown)  (unknown)                                                                          
(RayWorkerVllm pid=343836) Fatal Python error: Aborted                                                                                                                                                       
(RayWorkerVllm pid=343836)                                                                                                                                                                                   
(RayWorkerVllm pid=343836)                                                                                                                                                                                   
(RayWorkerVllm pid=343836) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, uvloop.loop, ray._raylet, numpy.core._multiarray
_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, num
py.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch
._C._special, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks
, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_s
ettings, pydantic.tools, pydantic, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslib
s.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslib
s.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs
.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pyarr
ow._compute, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.ag
gregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._json (total: 100)
(RayWorkerVllm pid=343839) [W HIPGraph.cpp:146] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator()) [repeated 7x across cluster]                           
2023-12-20 08:58:56,879 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask 
ID: ffffffffffffffff1eb792b3342f34ed845a43fa01000000 Worker ID: e9bb8aa7fda24dd067e8ad733d4e24189714c3bed733752ec28abd7b Node ID: d66648ef8fe477f6e2f72c34993065a8f40a43fd53938766db8f6c7a Worker IP address:
 172.17.0.2 Worker port: 33347 Worker PID: 343836 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root cau
ses. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.          
Traceback (most recent call last):                                                                                                                                                                           
  File "/home/aac/apps/vllm-rocm/vllm-rocm/benchmarks/benchmark_throughput.py", line 318, in <module>                                                                                                        
    main(args)                                                                                                                                                                                               
  File "/home/aac/apps/vllm-rocm/vllm-rocm/benchmarks/benchmark_throughput.py", line 205, in main                                                                                                            
    elapsed_time = run_vllm(requests, args.model, args.tokenizer,                                                                                                                                            
  File "/home/aac/apps/vllm-rocm/vllm-rocm/benchmarks/benchmark_throughput.py", line 107, in run_vllm                                                                                                        
    llm._run_engine(use_tqdm=True)                                                                                                                                                                           
  File "/home/aac/libs/anaconda3/envs/vllm-rocm/lib/python3.10/site-packages/vllm-0.2.6+rocm563-py3.10-linux-x86_64.egg/vllm/entrypoints/llm.py", line 185, in _run_engine                                   
    step_outputs = self.llm_engine.step()                                                                                                                                                                    
  File "/home/aac/libs/anaconda3/envs/vllm-rocm/lib/python3.10/site-packages/vllm-0.2.6+rocm563-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 589, in step                                        
    output = self._run_workers(                                                                                                                                                                              
  File "/home/aac/libs/anaconda3/envs/vllm-rocm/lib/python3.10/site-packages/vllm-0.2.6+rocm563-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 763, in _run_workers                                
    self._run_workers_in_batch(workers, method, *args, **kwargs))                                                                                                                                            
  File "/home/aac/libs/anaconda3/envs/vllm-rocm/lib/python3.10/site-packages/vllm-0.2.6+rocm563-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 740, in _run_workers_in_batch                       
    all_outputs = ray.get(all_outputs)                                                                                                                                                                       
  File "/home/aac/libs/anaconda3/envs/vllm-rocm/lib/python3.10/site-packages/ray-2.8.1-py3.10-linux-x86_64.egg/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper                                
    return fn(*args, **kwargs)                                                                                                                                                                               
  File "/home/aac/libs/anaconda3/envs/vllm-rocm/lib/python3.10/site-packages/ray-2.8.1-py3.10-linux-x86_64.egg/ray/_private/client_mode_hook.py", line 103, in wrapper                                       
    return func(*args, **kwargs)                                                                                                                                                                             
  File "/home/aac/libs/anaconda3/envs/vllm-rocm/lib/python3.10/site-packages/ray-2.8.1-py3.10-linux-x86_64.egg/ray/_private/worker.py", line 2565, in get                                                    
    raise value                                                                                                                                                                                              
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.                                                                                                                        
        class_name: RayWorkerVllm                                                                                                                                                                            
        actor_id: 1eb792b3342f34ed845a43fa01000000                                                                                                                                                           
        pid: 343836                                                                                                                                                                                          
        namespace: 83333617-d010-403c-97cf-94e07ad5000f                                                                                                                                                      
        ip: 172.17.0.2                                                                                                                                                                                       
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential roo
t causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.     
2023-12-20 08:58:56,988 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask 
ID: ffffffffffffffff356fd56fbe6a8304644da78601000000 Worker ID: 127d225aef840be899f32c0664b19be0d2112ace50eb50e6c61ecab6 Node ID: d66648ef8fe477f6e2f72c34993065a8f40a43fd53938766db8f6c7a Worker IP address:
 172.17.0.2 Worker port: 45847 Worker PID: 343840 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root cau
ses. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-12-20 08:58:56,989 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask 
ID: ffffffffffffffff6c3fb285bcc356e35f7883ed01000000 Worker ID: 92a75c925225ae7fec52fa4c4974f0841e3f8fd3a0482e82defd9b56 Node ID: d66648ef8fe477f6e2f72c34993065a8f40a43fd53938766db8f6c7a Worker IP address:
 172.17.0.2 Worker port: 38807 Worker PID: 343834 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root cau
ses. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-12-20 08:58:57,007 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask 
ID: ffffffffffffffff6c3fb285bcc356e35f7883ed01000000 Worker ID: 92a75c925225ae7fec52fa4c4974f0841e3f8fd3a0482e82defd9b56 Node ID: d66648ef8fe477f6e2f72c34993065a8f40a43fd53938766db8f6c7a Worker IP a[0/499] 172.17.0.2 Worker port: 38807 Worker PID: 343834 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-12-20 08:58:57,007 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: fffffffffffffffff0750ef9d99ba21c366a30a501000000 Worker ID: 1eba0f212abc9d3f83cd39cf4a6d1da0a6e300e882473d216998def4 Node ID: d66648ef8fe477f6e2f72c34993065a8f40a43fd53938766db8f6c7a Worker IP address: 172.17.0.2 Worker port: 44109 Worker PID: 343835 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-12-20 08:58:57,012 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff972d660ef7ea553cff93b2eb01000000 Worker ID: 6027b8e1cf656e3b569024c7dcb6e4854793942bb642f22ad919a820 Node ID: d66648ef8fe477f6e2f72c34993065a8f40a43fd53938766db8f6c7a Worker IP address: 172.17.0.2 Worker port: 42809 Worker PID: 343833 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-12-20 08:58:57,077 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff3c787e9c74cdae7903f4c7bb01000000 Worker ID: c38325cc5acd845557107b4255482cba6c65f1180a9ff73161b5bfcd Node ID: d66648ef8fe477f6e2f72c34993065a8f40a43fd53938766db8f6c7a Worker IP address: 172.17.0.2 Worker port: 36369 Worker PID: 343839 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-12-20 08:58:57,134 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff8253f59aff950e98e64eb3bc01000000 Worker ID: 39144eca4ab118e32a72ac2791cdb101ca37b4d17cc871859213e80f Node ID: d66648ef8fe477f6e2f72c34993065a8f40a43fd53938766db8f6c7a Worker IP address: 172.17.0.2 Worker port: 44531 Worker PID: 343837 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-12-20 08:58:57,156 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa6d1af73a1a334451a12b33201000000 Worker ID: bede58021b8753552e961ab328ca01afb3b0c38eca8ca7f95efe729b Node ID: d66648ef8fe477f6e2f72c34993065a8f40a43fd53938766db8f6c7a Worker IP address: 172.17.0.2 Worker port: 33089 Worker PID: 343838 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorkerVllm pid=343837) INFO 12-20 08:55:26 model_runner.py:437] Graph capturing finished in 35 secs. [repeated 7x across cluster]
(RayWorkerVllm pid=343837) /pytorch/aten/src/ATen/native/hip/IndexKernel.hip:94: operator(): Device-side assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"' failed. [repeated 448x across cluster]
(RayWorkerVllm pid=343837) :0:rocdevice.cpp            :2778: 8142196625371 us: 343837: [tid:0x7f9f777fe640] Callback: Queue 0x7f9f75200000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 [repeated 7x across cluster]
(RayWorkerVllm pid=343837) *** SIGABRT received at time=1703062736 on cpu 66 *** [repeated 7x across cluster]
(RayWorkerVllm pid=343837) PC: @     0x7fd067f97a7c  (unknown)  pthread_kill [repeated 7x across cluster]
(RayWorkerVllm pid=343837)     @     0x7fd067f43520  (unknown)  (unknown) [repeated 7x across cluster]
(RayWorkerVllm pid=343837) [2023-12-20 08:58:56,965 E 343837 345036] logging.cc:361: *** SIGABRT received at time=1703062736 on cpu 66 *** [repeated 7x across cluster]
(RayWorkerVllm pid=343837) [2023-12-20 08:58:56,965 E 343837 345036] logging.cc:361: PC: @     0x7fd067f97a7c  (unknown)  pthread_kill [repeated 7x across cluster]
(RayWorkerVllm pid=343837) [2023-12-20 08:58:56,965 E 343837 345036] logging.cc:361:     @     0x7fd067f43520  (unknown)  (unknown) [repeated 7x across cluster]
(RayWorkerVllm pid=343837) Fatal Python error: Aborted [repeated 7x across cluster]
(RayWorkerVllm pid=343837)  [repeated 14x across cluster]
(RayWorkerVllm pid=343837) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._json (total: 100) [repeated 7x across cluster]
e4eetjau

e4eetjau1#

same error when using Mixtral 8*7b and tp=4

cgh8pdjw

cgh8pdjw2#

使用CUDA图时,当使用Qwen-72b和tp=8时出现相同的错误。

gz5pxeao

gz5pxeao3#

same error when using Mixtral 8*7b and tp=4

相关问题