**概述:**我尝试在单节点计算工作站上安装Slurm,以便于用户之间有组织地提交作业。这主要是为了运行自编码的并行程序(在此:binary
),通过mpirun -np n binary
执行时成功运行。
版本:
- 操作系统:Ubuntu 20.04.3(up to date)
- Slurm:19.05.5-1(通过apt)
- OpenMPI:4.0.3(通过apt)
**Slurm安装/配置:**我按照here的说明安装和配置了Slurm,并附上了我的slurm.conf
文件。完成后,Slurm似乎可以正常工作,例如,sinfo
显示:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle nodename
并且srun -n 2 echo "Look at me"
产生:
Look at me
Look at me
并且,当使用命令srun -n8 sleep 10
时,squeue
显示:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
44 debug sleep username R 0:04 1 nodename
**错误:**我尝试通过sbatch runslurm.sh
命令使用shell脚本runslurm.sh
运行程序binary
。下面是shell脚本:
#!/bin/bash
#SBATCH -J testjob
#SBATCH -e error.%A
#SBATCH -o output.%A
#SBATCH -N 1
#SBATCH -n 2
srun binary
作业将立即完成,并只写入一个错误文件,其中包含:
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[nodename:470173] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and n
ot able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[nodename:470174] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and n
ot able to guarantee that all other processes were killed!
srun: error: nodename: tasks 0-1: Exited with exit code 1
**其他尝试:**我已经在shell脚本的srun
命令中尝试了所有可用的MPI类型(即srun --mpi=XXX binary
,其中XXX
通过srun --mpi=list
确定:
srun: MPI types are...
srun: pmix_v3
srun: pmix
srun: openmpi
srun: none
srun: pmi2
使用--mpi=openmpi
、--mpi=none
或--mpi=pmi2
时,错误与上述相同。
使用--mpi=pmix_v3
或--mpi=pmix
时,两种情况下的错误类似:
srun: error: (null) [0] /mpi_pmix.c:133 [init] mpi/pmix: ERROR: pmi/pmix: can not load PMIx library
srun: error: Couldn't load specified plugin name for mpi/pmix_v3: Plugin init() callback failed
srun: error: cannot create mpi context for mpi/pmix_v3
srun: error: invalid MPI type 'pmix_v3', --mpi=list for acceptable types
**其他信息:**注意OpenMPI似乎是针对Slurm安装的(如果需要,我可以复制所有ompi_info
):
ompi_info | grep slurm
Configure command line: ...'--with-slurm'...
MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)
以及pmix:
ompi_info | grep pmix
Configure command line: ...'--with-pmix=/usr/lib/x86_64-linux-gnu/pmix'...
MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.0.3)
MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.3)
MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.3)
下面是slurm.conf
:
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=nodename
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
#SlurmctldLogFile=
SlurmdDebug=info
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=nodename Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=nodename Default=YES MaxTime=INFINITE State=UP
1条答案
按热度按时间vsaztqbk1#
我在一组docker容器上遇到了这个问题,我正在设置模拟运行Ubuntu 20.04的集群。我通过安装这些包“libpmi 1-pmix libpmi 2-pmix libpmix-dev libpmix 2 libopenmpi-dev libopenmpi 3 libpmi-pmix-dev”并在slurm. conf中设置“MpiDefault=pmix”来修复它。