获取slurm内部环境变量以及变量作用
一、slurm变量简介
常用环境变量(如何调度环境变量请点击这里)
下表是 SLURM 环境变量(长变量):
变量 | 说明 |
---|---|
SLURM_NPROCS | 要加载的进程数 |
SLURM_TASKS_PER_NODE | 每节点要加载的任务数 |
SLURM_JOB_ID | 作业的 JobID |
SLURM_SUBMIT_DIR | 提交作业时的工作目录 |
SLURM_JOB_NODELIST | 作业分配的节点列表 |
SLURM_JOB_CPUS_PER_NODE | 每个节点上分配给作业的 CPU 数 |
SLURM_JOB_NUM_NODES | 作业分配的节点数 |
HOSTNAME | 对于批处理作业,此变量被设置为批处理脚本所执行节点的节点名 |
短变量
%%
The character "%".
%A
Job array's master job allocation number.
%a
Job array ID (index) number.
%J
jobid.stepid of the running job. (e.g. "128.0")
%j
jobid of the running job.
%s
stepid of the running job.
%N
short hostname. This will create a separate IO file per node.
%n
Node identifier relative to current job (e.g. "0" is the first node of the running job)
This will create a separate IO file per node.
%t
task identifier (rank) relative to current job. This will create a separate IO file per task.
%u
User name.
%x
Job name.
The expression "%o" will be replaced with the task's offset within this range (e.g. a configured task rank value of "1-5" would have offset values of "0-4")
二、slurm变量获取:
目录在:/lustre2/teach_pkuhpc/example/s01_getjobid
1、使用pkurun-cnlong 2 1 sleep 1 生成脚本,脚本为job.srb+一堆数字,这里是job.srp230917
[test_pkuhpc@login05 s01_getjobid]$ pkurun-cnlong 1 1 sleep 1
Submitted batch job 1115431
2、重命名:
[test_pkuhpc@login05 s01_getjobid]$ mv job.srp230917 job-env.srp
3、查看内容:
[test_pkuhpc@login05 s01_getjobid]$ cat job-env.srp
#!/bin/bash
#SBATCH -J sle230917
#SBATCH -p cn-long
#SBATCH -N 2
#SBATCH -o sle230917_%j.out
#SBATCH -e sle230917_%j.err
#SBATCH --no-requeue
#SBATCH -A test_g1
#SBATCH --qos=testcnl
#SBATCH -c 1
pkurun sleep 1
4、使用vi 修改:
[test_pkuhpc@login05 s01_getjobid]$ vi job-env.srp
#!/bin/bash
#SBATCH -J sle230917
#SBATCH -p cn-long
#SBATCH -N 2
#SBATCH -o sle230917_%j.out
#SBATCH -e sle230917_%j.err
#SBATCH --no-requeue
#SBATCH -A test_g1
#SBATCH --qos=testcnl
#SBATCH -c 1
pkurun sleep 1
#说明:i编辑,esc推出编辑模式;在命令模输入:wq就保存
——————————————————
修改内容如下:
#!/bin/bash
#SBATCH -J sle230917
#SBATCH -p cn-long
#SBATCH -N 1
#SBATCH -o sle230917_%j.out
#SBATCH -e sle230917_%j.err
#SBATCH --no-requeue
#SBATCH -A test_g1
#SBATCH --qos=testcnl
#SBATCH -c 1
if [ X"$SLURM_STEP_ID" = "X" -a X"$SLURM_PROCID" = "X"0 ]
then
echo "print =========================================="
echo "print SLURM_JOB_ID = $SLURM_JOB_ID"
echo "print SLURM_NODELIST = $SLURM_NODELIST"
echo "print =========================================="
hosts=`scontrol show hostname $SLURM_JOB_NODELIST` ;echo $hosts
fi
#说明 SLURM_JOB_ID为任务id;SLURM_NODELIST为节点列表如:c01b05n[05-07],c02b05n[04-05,11-12]
#使用scontrol show hostname $SLURM_JOB_NODELIST可以变成空格每个节点c01b05n05 c01b05n06 c01b05n07 ...
5、提交任务
[test_pkuhpc@login05 s01_getjobid]$ sbatch job-env.srp
Submitted batch job 1115432
6、查看输出
[test_pkuhpc@login05 s01_getjobid]$ ll
total 9
-rwx------ 1 test_pkuhpc test_pkuhpc 536 Nov 13 23:21 job-env.srp
-rwx------ 1 test_pkuhpc test_pkuhpc 469 Nov 9 23:37 job.srp
-rw-rw-r-- 1 test_pkuhpc test_pkuhpc 0 Nov 13 23:22 sle230917_1115432.err
-rw-rw-r-- 1 test_pkuhpc test_pkuhpc 185 Nov 13 23:22 sle230917_1115432.out
[test_pkuhpc@login05 s01_getjobid]$ cat sle230917_1115432.out
print ==========================================
print SLURM_JOB_ID = 1115432
print SLURM_NODELIST = c05b03n[03-04]
print ==========================================
c05b03n03 c05b03n04
7、脚本中job-env.srp:sle230917_%j.out 其中j%为短环境变量
三、环境变量可以用在:
比如SLURM_JOB_ID可以作为matlab获取随机唯一种子,因为id是唯一的
SLURM_NODELIST可以用在获取节点名称,查看任务错误的原因和可能的节点,还有提交任务的时候指定hostfile等