当前环境
root@cy-ah85026:/vllm-workspace# ray status
======== Autoscaler status: 2024-08-02 02:04:32.248220 ========
Node status
---------------------------------------------------------------
Active:
1 node_a689bb663bd4154a54614777bcc40f9c44b7dbd5c83f6cd7862b09cd
1 node_b07cdf56b532f9dfd446b08e9ddb2f1c9f4c46ab600779999807e482
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/256.0 CPU
0.0/16.0 GPU
0B/695.32GiB memory
0B/301.99GiB object_store_memory
Demands:
(no resource demands)
您希望如何使用vllm
root@cy-ah85026:/vllm-workspace# vllm serve /mnt/cpn-pod/models/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --served-model-name meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 -tp 16 --distributed-executor-backend=ray --max-model-len 4096
INFO 08-02 01:42:35 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 08-02 01:42:35 api_server.py:220] args: Namespace(model_tag='/mnt/cpn-pod/models/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/mnt/cpn-pod/models/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=16, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['meta-llama/Meta-Llama-3.1-405B-Instruct-FP8'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7f470f28cd30>)
2024-08-02 01:42:35,385 INFO worker.py:1603 -- Connecting to existing Ray cluster at address: 192.168.200.178:6379...
2024-08-02 01:42:35,394 INFO worker.py:1788 -- Connected to Ray cluster.
INFO 08-02 01:42:35 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/mnt/cpn-pod/models/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', speculative_config=None, tokenizer='/mnt/cpn-pod/models/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=16, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fbgemm_fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-02 01:43:14 utils.py:784] Found nccl from library libnccl.so.2
INFO 08-02 01:43:14 pynccl.py:63] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=3886, ip=192.168.200.191) INFO 08-02 01:43:14 utils.py:784] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=3886, ip=192.168.200.191) INFO 08-02 01:43:14 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 08-02 01:43:14 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes.
INFO 08-02 01:43:14 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='192.168.200.178', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f46f42f3dc0>, local_subscribe_port=48177, local_sync_port=39553, remote_subscribe_port=52373, remote_sync_port=39723)
(RayWorkerWrapper pid=16311) WARNING 08-02 01:43:14 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes.
INFO 08-02 01:43:14 model_runner.py:680] Starting to load model /mnt/cpn-pod/models/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8...
(RayWorkerWrapper pid=16311) INFO 08-02 01:43:14 model_runner.py:680] Starting to load model /mnt/cpn-pod/models/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8...
Loading safetensors checkpoint shards: 0% Completed | 0/109 [00:00<?, ?it/s]
[... deleted for short content ...]
Loading safetensors checkpoint shards: 95% Completed | 104/109 [13:55<00:41, 8.40s/it]
(RayWorkerWrapper pid=4418, ip=192.168.200.191) INFO 08-02 01:57:11 model_runner.py:692] Loading model weights took 28.8984 GB
(RayWorkerWrapper pid=16845) INFO 08-02 01:43:14 utils.py:784] Found nccl from library libnccl.so.2 [repeated 14x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=16845) INFO 08-02 01:43:14 pynccl.py:63] vLLM is using nccl==2.20.5 [repeated 14x across cluster]
(RayWorkerWrapper pid=4507, ip=192.168.200.191) WARNING 08-02 01:43:14 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes. [repeated 14x across cluster]
(RayWorkerWrapper pid=4507, ip=192.168.200.191) INFO 08-02 01:43:14 model_runner.py:680] Starting to load model /mnt/cpn-pod/models/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8... [repeated 14x across cluster]
Loading safetensors checkpoint shards: 96% Completed | 105/109 [14:03<00:32, 8.18s/it]
Loading safetensors checkpoint shards: 97% Completed | 106/109 [14:05<00:19, 6.36s/it]
Loading safetensors checkpoint shards: 98% Completed | 107/109 [14:06<00:09, 4.80s/it]
Loading safetensors checkpoint shards: 99% Completed | 108/109 [14:18<00:06, 6.87s/it]
Loading safetensors checkpoint shards: 100% Completed | 109/109 [14:25<00:00, 7.09s/it]
Loading safetensors checkpoint shards: 100% Completed | 109/109 [14:25<00:00, 7.94s/it]
INFO 08-02 01:57:41 model_runner.py:692] Loading model weights took 28.8984 GB
(RayWorkerWrapper pid=16845) INFO 08-02 01:57:41 model_runner.py:692] Loading model weights took 28.8984 GB [repeated 8x across cluster]
(RayWorkerWrapper pid=4329, ip=192.168.200.191) void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
(RayWorkerWrapper pid=4329, ip=192.168.200.191) void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32,
[... deleted for short content ...]
(RayWorkerWrapper pid=4329, ip=192.168.200.191) void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] Traceback (most recent call last):
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return executor(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return func(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] self.model_runner.profile_run()
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return func(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] self.execute_model(model_input, kv_caches, intermediate_tensors)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return func(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1314, in execute_model
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] hidden_or_intermediate_states = model_executable(
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] model_output = self.model(input_ids, positions, kv_caches,
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] hidden_states, residual = layer(
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 255, in forward
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] hidden_states = self.mlp(hidden_states)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 89, in forward
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] x, _ = self.down_proj(x)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 783, in forward
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] output_parallel = self.quant_method.apply(self,
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fbgemm_fp8.py", line 175, in apply
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return apply_fp8_linear(input=x,
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 126, in apply_fp8_linear
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return ops.cutlass_scaled_mm(qinput,
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34, in wrapper
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return fn(*args, **kwargs)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 251, in cutlass_scaled_mm
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854, in __call__
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] return self_._op(*args, **(kwargs or {}))
(RayWorkerWrapper pid=4329, ip=192.168.200.191) ERROR 08-02 01:57:45 worker_base.py:382] RuntimeError: Error Internal
(RayWorkerWrapper pid=4329, ip=192.168.200.191) [rank13]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 13] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure
(RayWorkerWrapper pid=4329, ip=192.168.200.191) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=4329, ip=192.168.200.191) For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorkerWrapper pid=4329, ip=192.168.200.191) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerWrapper pid=4329, ip=192.168.200.191)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fca946c8897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fca94678b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fca947a0718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa3f3db88e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fa3f3dbc9e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fa3f3dc205c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa3f3dc2dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #7: <unknown function> + 0xd6df4 (0x7fcada1cfdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #8: <unknown function> + 0x8609 (0x7fcadc135609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #9: clone + 0x43 (0x7fcadc26f353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=4329, ip=192.168.200.191)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) [2024-08-02 01:57:45,159 E 4329 4697] logging.cc:108: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 2 Rank 13] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure
(RayWorkerWrapper pid=4329, ip=192.168.200.191) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=4329, ip=192.168.200.191) For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorkerWrapper pid=4329, ip=192.168.200.191) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerWrapper pid=4329, ip=192.168.200.191)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fca946c8897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fca94678b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fca947a0718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa3f3db88e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fa3f3dbc9e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fa3f3dc205c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa3f3dc2dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #7: <unknown function> + 0xd6df4 (0x7fcada1cfdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #8: <unknown function> + 0x8609 (0x7fcadc135609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #9: clone + 0x43 (0x7fcadc26f353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=4329, ip=192.168.200.191)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fca946c8897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #1: <unknown function> + 0xe32119 (0x7fa3f3a46119 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #2: <unknown function> + 0xd6df4 (0x7fcada1cfdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #3: <unknown function> + 0x8609 (0x7fcadc135609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) frame #4: clone + 0x43 (0x7fcadc26f353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=4329, ip=192.168.200.191)
(RayWorkerWrapper pid=4507, ip=192.168.200.191)
(RayWorkerWrapper pid=4507, ip=192.168.200.191)
(RayWorkerWrapper pid=4507, ip=192.168.200.191)
(RayWorkerWrapper pid=4507, ip=192.168.200.191)
(RayWorkerWrapper pid=4507, ip=192.168.200.191)
(RayWorkerWrapper pid=4507, ip=192.168.200.191) [2024-08-02 01:57:45,114 E 4507 4689] logging.cc:115: Stack trace:
(RayWorkerWrapper pid=4507, ip=192.168.200.191) /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x109aaea) [0x7f9f99cedaea] ray::operator<<()
(RayWorkerWrapper pid=4507, ip=192.168.200.191) /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x109dd72) [0x7f9f99cf0d72] ray::TerminateHandler()
(RayWorkerWrapper pid=4507, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f9f98b1b37c]
(RayWorkerWrapper pid=4507, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f9f98b1b3e7]
(RayWorkerWrapper pid=4507, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f9f98b1b36f]
(RayWorkerWrapper pid=4507, ip=192.168.200.191) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe321ca) [0x7f78b3a461ca] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorkerWrapper pid=4507, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f9f98b47df4]
(RayWorkerWrapper pid=4507, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f9f9aaad609] start_thread
(RayWorkerWrapper pid=4507, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f9f9abe7353] __clone
(RayWorkerWrapper pid=4507, ip=192.168.200.191)
(RayWorkerWrapper pid=4507, ip=192.168.200.191) *** SIGABRT received at time=1722563865 on cpu 90 ***
(RayWorkerWrapper pid=4507, ip=192.168.200.191) PC: @ 0x7f9f9ab0b00b (unknown) raise
(RayWorkerWrapper pid=4507, ip=192.168.200.191) @ 0x7f9f9ab0b090 3216 (unknown)
(RayWorkerWrapper pid=4507, ip=192.168.200.191) @ 0x7f9f98b1b37c (unknown) (unknown)
(RayWorkerWrapper pid=4507, ip=192.168.200.191) @ 0x7f9f98b1b090 (unknown) (unknown)
(RayWorkerWrapper pid=4507, ip=192.168.200.191) [2024-08-02 01:57:45,115 E 4507 4689] logging.cc:440: *** SIGABRT received at time=1722563865 on cpu 90 ***
(RayWorkerWrapper pid=4507, ip=192.168.200.191) [2024-08-02 01:57:45,115 E 4507 4689] logging.cc:440: PC: @ 0x7f9f9ab0b00b (unknown) raise
(RayWorkerWrapper pid=4507, ip=192.168.200.191) [2024-08-02 01:57:45,115 E 4507 4689] logging.cc:440: @ 0x7f9f9ab0b090 3216 (unknown)
(RayWorkerWrapper pid=4507, ip=192.168.200.191) [2024-08-02 01:57:45,115 E 4507 4689] logging.cc:440: @ 0x7f9f98b1b37c (unknown) (unknown)
(RayWorkerWrapper pid=4507, ip=192.168.200.191) [2024-08-02 01:57:45,115 E 4507 4689] logging.cc:440: @ 0x7f9f98b1b090 (unknown) (unknown)
(RayWorkerWrapper pid=4507, ip=192.168.200.191) Fatal Python error: Aborted
(RayWorkerWrapper pid=4507, ip=192.168.200.191)
(RayWorkerWrapper pid=4507, ip=192.168.200.191)
(RayWorkerWrapper pid=4507, ip=192.168.200.191) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow.lib, pyarrow._json, PIL._imaging, zmq.backend.cython._zmq (total: 41)
(RayWorkerWrapper pid=4329, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7fcada1a337c]
(RayWorkerWrapper pid=4329, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7fcada1a33e7]
(RayWorkerWrapper pid=4329, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7fcada1a336f]
(RayWorkerWrapper pid=4329, ip=192.168.200.191) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe321ca) [0x7fa3f3a461ca] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorkerWrapper pid=4329, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7fcada1cfdf4]
(RayWorkerWrapper pid=4329, ip=192.168.200.191)
(RayWorkerWrapper pid=4329, ip=192.168.200.191)
(RayWorkerWrapper pid=4329, ip=192.168.200.191)
(RayWorkerWrapper pid=16311)
(RayWorkerWrapper pid=16311)
(RayWorkerWrapper pid=16311)
(RayWorkerWrapper pid=16311)
(RayWorkerWrapper pid=16311)
(RayWorkerWrapper pid=16489)
(RayWorkerWrapper pid=16489)
(RayWorkerWrapper pid=16489)
(RayWorkerWrapper pid=16489)
(RayWorkerWrapper pid=16489)
(RayWorkerWrapper pid=16489) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f64cac0237c]
(RayWorkerWrapper pid=16489) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f64cac023e7]
(RayWorkerWrapper pid=16489) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f64cac0236f]
(RayWorkerWrapper pid=16489) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe321ca) [0x7f3df3ee61ca] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorkerWrapper pid=16489) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f64cac2edf4]
(RayWorkerWrapper pid=16489)
(RayWorkerWrapper pid=16489)
(RayWorkerWrapper pid=16489)
(RayWorkerWrapper pid=16311) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f5ebddff37c]
(RayWorkerWrapper pid=16311) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f5ebddff3e7]
(RayWorkerWrapper pid=16311) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f5ebddff36f]
(RayWorkerWrapper pid=16311) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe321ca) [0x7f37e3ee61ca] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorkerWrapper pid=16311) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f5ebde2bdf4]
(RayWorkerWrapper pid=16311)
(RayWorkerWrapper pid=16311)
(RayWorkerWrapper pid=16311)
(RayWorkerWrapper pid=3974, ip=192.168.200.191)
(RayWorkerWrapper pid=3974, ip=192.168.200.191)
(RayWorkerWrapper pid=3974, ip=192.168.200.191)
(RayWorkerWrapper pid=3974, ip=192.168.200.191)
(RayWorkerWrapper pid=3974, ip=192.168.200.191)
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
[... deleted for short content ...]
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4842abb897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4842a6bb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4842b93718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f4843d908e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4843d949e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f4843d9a05c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4843d9adcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7f488f851df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f4890913609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f4890a4d353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[2024-08-02 01:57:45,520 E 16097 17038] logging.cc:108: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4842abb897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4842a6bb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4842b93718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f4843d908e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4843d949e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f4843d9a05c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4843d9adcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7f488f851df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f4890913609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f4890a4d353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4842abb897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f4843a1e119 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f488f851df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f4890913609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f4890a4d353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=4062, ip=192.168.200.191)
(RayWorkerWrapper pid=4062, ip=192.168.200.191)
(RayWorkerWrapper pid=4062, ip=192.168.200.191)
(RayWorkerWrapper pid=4062, ip=192.168.200.191)
(RayWorkerWrapper pid=4062, ip=192.168.200.191)
(RayWorkerWrapper pid=4062, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7fefedeb737c]
(RayWorkerWrapper pid=4062, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7fefedeb73e7]
(RayWorkerWrapper pid=4062, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7fefedeb736f]
(RayWorkerWrapper pid=4062, ip=192.168.200.191) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe321ca) [0x7fc903a461ca] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorkerWrapper pid=4062, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7fefedee3df4]
(RayWorkerWrapper pid=4062, ip=192.168.200.191)
(RayWorkerWrapper pid=4062, ip=192.168.200.191)
(RayWorkerWrapper pid=4062, ip=192.168.200.191)
(RayWorkerWrapper pid=4151, ip=192.168.200.191)
(RayWorkerWrapper pid=4151, ip=192.168.200.191)
(RayWorkerWrapper pid=4151, ip=192.168.200.191)
(RayWorkerWrapper pid=4151, ip=192.168.200.191)
(RayWorkerWrapper pid=4151, ip=192.168.200.191)
(RayWorkerWrapper pid=4151, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f9572b4737c]
(RayWorkerWrapper pid=4151, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f9572b473e7]
(RayWorkerWrapper pid=4151, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f9572b4736f]
(RayWorkerWrapper pid=4151, ip=192.168.200.191) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe321ca) [0x7f6e8fa461ca] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorkerWrapper pid=4151, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f9572b73df4]
(RayWorkerWrapper pid=4151, ip=192.168.200.191)
(RayWorkerWrapper pid=4151, ip=192.168.200.191)
(RayWorkerWrapper pid=4151, ip=192.168.200.191)
(RayWorkerWrapper pid=4240, ip=192.168.200.191)
(RayWorkerWrapper pid=4240, ip=192.168.200.191)
(RayWorkerWrapper pid=4240, ip=192.168.200.191)
(RayWorkerWrapper pid=4240, ip=192.168.200.191)
(RayWorkerWrapper pid=4240, ip=192.168.200.191)
(RayWorkerWrapper pid=4240, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f177f31437c]
(RayWorkerWrapper pid=4240, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f177f3143e7]
(RayWorkerWrapper pid=4240, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f177f31436f]
(RayWorkerWrapper pid=4240, ip=192.168.200.191) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe321ca) [0x7ef097a461ca] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorkerWrapper pid=4240, ip=192.168.200.191) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f177f340df4]
(RayWorkerWrapper pid=4240, ip=192.168.200.191)
(RayWorkerWrapper pid=4240, ip=192.168.200.191)
(RayWorkerWrapper pid=4240, ip=192.168.200.191)
(RayWorkerWrapper pid=16756)
(RayWorkerWrapper pid=16756)
(RayWorkerWrapper pid=16756)
(RayWorkerWrapper pid=16756)
(RayWorkerWrapper pid=16756)
[2024-08-02 01:57:45,563 E 16097 17038] logging.cc:115: Stack trace:
/usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x109aaea) [0x7f4726eb3aea] ray::operator<<()
/usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x109dd72) [0x7f4726eb6d72] ray::TerminateHandler()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f488f82537c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f488f8253e7]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f488f82536f]
/usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe321ca) [0x7f4843a1e1ca] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f488f851df4]
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f4890913609] start_thread
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f4890a4d353] __clone
*** SIGABRT received at time=1722563865 on cpu 100 ***
PC: @ 0x7f489097100b (unknown) raise
@ 0x7f4890971090 3216 (unknown)
@ 0x7f488f82537c (unknown) (unknown)
@ 0x7f488f825090 (unknown) (unknown)
[2024-08-02 01:57:45,566 E 16097 17038] logging.cc:440: *** SIGABRT received at time=1722563865 on cpu 100 ***
[2024-08-02 01:57:45,566 E 16097 17038] logging.cc:440: PC: @ 0x7f489097100b (unknown) raise
[2024-08-02 01:57:45,566 E 16097 17038] logging.cc:440: @ 0x7f4890971090 3216 (unknown)
[2024-08-02 01:57:45,567 E 16097 17038] logging.cc:440: @ 0x7f488f82537c (unknown) (unknown)
[2024-08-02 01:57:45,568 E 16097 17038] logging.cc:440: @ 0x7f488f825090 (unknown) (unknown)
Fatal Python error: Aborted
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, PIL._imaging, regex._regex, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._json, markupsafe._speedups(RayWorkerWrapper pid=4418, ip=192.168.200.191)
, zmq.backend.cython._zmq (total: 98)
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Aborted (core dumped)
4条答案
按热度按时间imzjd6km1#
-tp 8 -pp 2 similar result
yiytaume2#
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd]未实现
这应该是根本原因。cc @robertgshaw2-neuralmagic@mgoin
eivgtgni3#
你应该提供主机的环境,例如你使用的GPU。
4smxwvx54#