slurm和作业调度

获取slurm内部环境变量以及变量作用

2019-11-13 23:27:25 admin 3382

一、slurm变量简介

常用环境变量(如何调度环境变量请点击这里)

下表是 SLURM 环境变量(长变量):


变量说明
SLURM_NPROCS要加载的进程数
SLURM_TASKS_PER_NODE每节点要加载的任务数
SLURM_JOB_ID作业的 JobID
SLURM_SUBMIT_DIR提交作业时的工作目录
SLURM_JOB_NODELIST作业分配的节点列表
SLURM_JOB_CPUS_PER_NODE每个节点上分配给作业的 CPU 数
SLURM_JOB_NUM_NODES作业分配的节点数
HOSTNAME对于批处理作业,此变量被设置为批处理脚本所执行节点的节点名

短变量


%%

The character "%".

%A

Job array's master job allocation number.

%a

Job array ID (index) number.

%J

jobid.stepid of the running job. (e.g. "128.0")

%j

jobid of the running job.

%s

stepid of the running job.

%N

short hostname. This will create a separate IO file per node.

%n

Node identifier relative to current job (e.g. "0" is the first node of the running job)

This will create a separate IO file per node.

%t

task identifier (rank) relative to current job. This will create a separate IO file per task.

%u

User name.

%x

Job name.

The expression "%o" will be replaced with the task's offset within this range (e.g. a configured task rank value of "1-5" would have offset values of "0-4")

二、slurm变量获取:

目录在:/lustre2/teach_pkuhpc/example/s01_getjobid

1、使用pkurun-cnlong 2 1 sleep 1 生成脚本,脚本为job.srb+一堆数字,这里是job.srp230917

[test_pkuhpc@login05 s01_getjobid]$ pkurun-cnlong 1 1 sleep 1 

Submitted batch job 1115431

2、重命名:

[test_pkuhpc@login05 s01_getjobid]$ mv job.srp230917  job-env.srp

3、查看内容:

[test_pkuhpc@login05 s01_getjobid]$ cat job-env.srp 

#!/bin/bash

#SBATCH -J sle230917

#SBATCH -p cn-long

#SBATCH -N 2 

#SBATCH -o sle230917_%j.out

#SBATCH -e sle230917_%j.err

#SBATCH --no-requeue

#SBATCH -A test_g1

#SBATCH --qos=testcnl

#SBATCH -c 1

pkurun  sleep 1

4、使用vi 修改:

[test_pkuhpc@login05 s01_getjobid]$ vi job-env.srp 

#!/bin/bash

#SBATCH -J sle230917

#SBATCH -p cn-long

#SBATCH -N 2

#SBATCH -o sle230917_%j.out

#SBATCH -e sle230917_%j.err

#SBATCH --no-requeue

#SBATCH -A test_g1

#SBATCH --qos=testcnl

#SBATCH -c 1

pkurun  sleep 1

#说明:i编辑,esc推出编辑模式;在命令模输入:wq就保存

——————————————————

修改内容如下:

#!/bin/bash

#SBATCH -J sle230917

#SBATCH -p cn-long

#SBATCH -N 1 

#SBATCH -o sle230917_%j.out

#SBATCH -e sle230917_%j.err

#SBATCH --no-requeue

#SBATCH -A test_g1

#SBATCH --qos=testcnl

#SBATCH -c 1

if [ X"$SLURM_STEP_ID" = "X" -a X"$SLURM_PROCID" = "X"0 ]

then

  echo "print =========================================="

  echo "print SLURM_JOB_ID = $SLURM_JOB_ID"

  echo "print SLURM_NODELIST = $SLURM_NODELIST"

  echo "print =========================================="

hosts=`scontrol show hostname $SLURM_JOB_NODELIST` ;echo $hosts

fi

#说明 SLURM_JOB_ID为任务id;SLURM_NODELIST为节点列表如:c01b05n[05-07],c02b05n[04-05,11-12]

#使用scontrol show hostname $SLURM_JOB_NODELIST可以变成空格每个节点c01b05n05  c01b05n06 c01b05n07 ...

5、提交任务

[test_pkuhpc@login05 s01_getjobid]$ sbatch  job-env.srp 

Submitted batch job 1115432

6、查看输出

[test_pkuhpc@login05 s01_getjobid]$ ll

total 9

-rwx------ 1 test_pkuhpc test_pkuhpc 536 Nov 13 23:21 job-env.srp

-rwx------ 1 test_pkuhpc test_pkuhpc 469 Nov  9 23:37 job.srp

-rw-rw-r-- 1 test_pkuhpc test_pkuhpc   0 Nov 13 23:22 sle230917_1115432.err

-rw-rw-r-- 1 test_pkuhpc test_pkuhpc 185 Nov 13 23:22 sle230917_1115432.out

[test_pkuhpc@login05 s01_getjobid]$ cat sle230917_1115432.out

print ==========================================

print SLURM_JOB_ID = 1115432

print SLURM_NODELIST = c05b03n[03-04]

print ==========================================

c05b03n03 c05b03n04

7、脚本中job-env.srp:sle230917_%j.out 其中j%为短环境变量

三、环境变量可以用在:

比如SLURM_JOB_ID可以作为matlab获取随机唯一种子,因为id是唯一的

SLURM_NODELIST可以用在获取节点名称,查看任务错误的原因和可能的节点,还有提交任务的时候指定hostfile等

首页
资源&收费
集群
成果
问答