Deploying SLURM Cluster
Deploying the SLURM database isn't stable yet. Please feel free to create an issue so we can improve its stability.
Helm and Docker resources
The Helm resources are stored on the ClusterFactory Git Repository.
The Dockerfile is described in the git repository SquareFactory/slurm-docker.
The Docker images can be pulled with:
docker pull ghcr.io/squarefactory/slurm:latest-controller
docker pull ghcr.io/squarefactory/slurm:latest-login
docker pull ghcr.io/squarefactory/slurm:latest-db
docker pull ghcr.io/squarefactory/slurm:latest-rest
You should always verify the default Helm values before editing the values
field of an Argo CD Application
.
1. Preparation
Compared to the other guides we will start from scratch.
Delete the argo/slurm-cluster
directory (or rename it).
Deploying a SLURM cluster isn't easy and you MUST have these components ready:
- A LDAP server and a SSSD configuration, to synchronize the user ID across the cluster
- A MySQL server for the SLURM DB
- A JWT private key, for the authentication via REST API
- A MUNGE key, for the authentication of SLURM daemons
Namespace and AppProject
Create and apply the Namespace
and AppProject
:
apiVersion: v1
kind: Namespace
metadata:
name: slurm-cluster
labels:
app.kubernetes.io/name: slurm-cluster
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: slurm-cluster
namespace: argocd
# Finalizer that ensures that project is not deleted until it is not referenced by any application
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
description: Slurm cluster
# Allow manifests to deploy from any Git repos
sourceRepos:
- '*'
# Only permit applications to deploy to the namespace in the same cluster
destinations:
- namespace: slurm-cluster
server: https://kubernetes.default.svc
namespaceResourceWhitelist:
- kind: '*'
group: '*'
clusterResourceWhitelist:
- kind: '*'
group: '*'
kubectl apply -f argo/slurm-cluster/
LDAP deployment
Follow the guide.
Open a shell on the LDAP server container and create a user and a group
slurm
:
# Create user
dsidm -b "dc=example,dc=com" localhost user create \
--uid slurm \
--cn slurm \
--displayName slurm \
--homeDirectory "/dev/shm" \
--uidNumber 1501 \
--gidNumber 1501
# Create group
dsidm -b "dc=example,dc=com" localhost group create \
--cn slurm
# Add posixGroup property and gidNumber
dsidm -b "dc=example,dc=com" localhost group modify slurm \
"add:objectClass:posixGroup" \
"add:gidNumber:1501"
SSSD configuration
Let's store it in a Secret
:
- Create the
argo/slurm-cluster/secrets/
directory and create a-secret.yaml.local
file:
apiVersion: v1
kind: Secret
metadata:
name: sssd-secret
namespace: slurm-cluster
type: Opaque
stringData:
sssd.conf: |
# https://sssd.io/docs/users/troubleshooting/how-to-troubleshoot-backend.html
[sssd]
services = nss,pam,sudo,ssh
config_file_version = 2
domains = example-ldap
[sudo]
[nss]
[pam]
offline_credentials_expiration = 60
[domain/example-ldap]
debug_level=3
cache_credentials = True
dns_resolver_timeout = 15
override_homedir = /home/ldap-users/%u
id_provider = ldap
auth_provider = ldap
access_provider = ldap
chpass_provider = ldap
ldap_schema = rfc2307bis
ldap_uri = ldaps://dirsrv-389ds.ldap.svc.cluster.local:3636
ldap_default_bind_dn = cn=Directory Manager
ldap_default_authtok = <password>
ldap_search_timeout = 50
ldap_network_timeout = 60
ldap_user_member_of = memberof
ldap_user_gecos = cn
ldap_user_uuid = nsUniqueId
ldap_group_uuid = nsUniqueId
ldap_search_base = ou=people,dc=example,dc=com
ldap_group_search_base = ou=groups,dc=example,dc=com
ldap_sudo_search_base = ou=sudoers,dc=example,dc=com
ldap_user_ssh_public_key = nsSshPublicKey
ldap_account_expire_policy = rhds
ldap_access_order = filter, expire
ldap_access_filter = (objectClass=posixAccount)
ldap_tls_cipher_suite = HIGH
# On Ubuntu, the LDAP client is linked to GnuTLS instead of OpenSSL => cipher suite names are different
# In fact, it's not even a cipher suite name that goes here, but a so called "priority list" => see $> gnutls-cli --priority-list
# See https://backreference.org/2009/11/18/openssl-vs-gnutls-cipher-names/ , gnutls-cli is part of package gnutls-bin
Adapt this secret based on your LDAP configuration.
- Seal the secret:
cfctl kubeseal
- Apply the SealedSecret:
kubectl apply -f argo/slurm-cluster/secrets/sssd-sealed-secret.yaml
MySQL deployment
You can deploy MySQL using the Helm Chart of Bitnami and develop an Argo CD app.
After deploying the MySQL/MariaDB server, you must create a slurm database. Open a shell on the MySQL container and run:
mysql -u root -p -h localhost
# Enter your root password
create user 'slurm'@'%' identified by '<your password>';
grant all on slurm_acct_db.* TO 'slurm'@'%';
create database slurm_acct_db;
JWT Key generation
ssh-keygen -t rsa -b 4096 -m PEM -f jwtRS256.key
Let's store it in a Secret
:
- Create a
-secret.yaml.local
file:
apiVersion: v1
kind: Secret
metadata:
name: slurm-secret
namespace: slurm-cluster
type: Opaque
stringData:
jwt_hs256.key: |
-----BEGIN RSA PRIVATE KEY-----
...
-----END RSA PRIVATE KEY-----
- Seal the secret:
cfctl kubeseal
- Apply the SealedSecret:
kubectl apply -f argo/slurm-cluster/secrets/slurm-sealed-secret.yaml
MUNGE Key generation
# As root
dnf install -y munge
/usr/sbin/create-munge-key
cat /etc/munge/munge.key | base64
Let's store it in a Secret
:
- Create a
-secret.yaml.local
file:
apiVersion: v1
kind: Secret
metadata:
name: munge-secret
namespace: slurm-cluster
type: Opaque
data:
munge.key: |
<base 64 encoded key>
- Seal the secret:
cfctl kubeseal
- Apply the SealedSecret:
kubectl apply -f argo/cvmfs/secrets/munge-sealed-secret.yaml
2. Begin writing the slurm-cluster-<cluster name>-app.yaml
2.a. Argo CD Application configuration
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: slurm-cluster-<FILL ME: cluster name>-app
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: slurm-cluster
source:
# You should have forked this repo. Change the URL to your fork.
repoURL: git@github.com:<FILL ME: your account>/ClusterFactory.git
# You should use your branch too.
targetRevision: HEAD
path: helm/slurm-cluster
helm:
releaseName: slurm-cluster-<FILL ME: cluster name>
valueFiles:
- values-<FILL ME: cluster name>.yaml
destination:
server: 'https://kubernetes.default.svc'
namespace: slurm-cluster
syncPolicy:
automated:
prune: true # Specifies if resources should be pruned during auto-syncing ( false by default ).
selfHeal: true # Specifies if partial app sync should be executed when resources are changed only in target Kubernetes cluster and no git change detected ( false by default ).
allowEmpty: false # Allows deleting all application resources during automatic syncing ( false by default ).
syncOptions: []
retry:
limit: 5 # number of failed sync attempt retries; unlimited number of attempts if less than 0
backoff:
duration: 5s # the amount to back off. Default unit is seconds, but could also be a duration (e.g. "2m", "1h")
factor: 2 # a factor to multiply the base duration after each failed retry
maxDuration: 3m # the maximum amount of time allowed for the backoff strategy
2.b. Values: Configuring the SLURM cluster
Add:
sssd:
secretName: sssd-secret
munge:
secretName: munge-secret
jwt:
secretName: slurm-secret
slurmConfig:
clusterName: <FILL ME: cluster-name>
compute:
srunPortRangeStart: 60001
srunPortRangeEnd: 63000
debug: debug5
accounting: |
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurm-cluster-<FILL ME: cluster name>.slurm-cluster.svc.cluster.local
AccountingStoragePort=6819
AccountingStorageTRES=gres/gpu
controller:
parameters: enable_configless
debug: debug5
defaultResourcesAllocation: |
# Change accordingly
DefCpuPerGPU=4
DefMemPerCpu=7000
nodes: |
# Change accordingly
NodeName=cn[1-12] CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=128473 Gres=gpu:4
partitions: |
# Change accordingly
PartitionName=main Nodes=cn[1-12] Default=YES MaxTime=INFINITE State=UP OverSubscribe=NO TRESBillingWeights="CPU=2.6,Mem=0.25G,GRES/gpu=24.0"
gres: |
# Change accordingly
NodeName=cn[1-12] File=/dev/nvidia[0-3] AutoDetect=nvml
# Extra slurm.conf configuration
extra: |
LaunchParameters=enable_nss_slurm
DebugFlags=Script,Gang,SelectType
TCPTimeout=5
# MPI stacks running over Infiniband or OmniPath require the ability to allocate more
# locked memory than the default limit. Unfortunately, user processes on login nodes
# may have a small memory limit (check it by ulimit -a) which by default are propagated
# into Slurm jobs and hence cause fabric errors for MPI.
PropagateResourceLimitsExcept=MEMLOCK
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
SwitchType=switch/none
MpiDefault=pmix_v2
ReturnToService=2
GresTypes=gpu
PreemptType=preempt/qos
PreemptMode=REQUEUE
PreemptExemptTime=-1
Prolog=/etc/slurm/prolog.d/*
Epilog=/etc/slurm/epilog.d/*
# Federation
FederationParameters=fed_display
3. Slurm DB Deployment
3.a. Secrets
Assuming you have deployed LDAP and MySQL, we will store the slurmdbd.conf
inside a secret:
- Create a
-secret.yaml.local
file:
apiVersion: v1
kind: Secret
metadata:
name: slurmdbd-conf-secret
namespace: slurm-cluster
type: Opaque
stringData:
slurmdbd.conf: |
# See https://slurm.schedmd.com/slurmdbd.conf.html
### Main
DbdHost=slurm-cluster-<FILL ME: cluster name>-db-0
SlurmUser=slurm
### Logging
DebugLevel=debug5 # optional, defaults to 'info'. Possible values: fatal, error, info, verbose, debug, debug[2-5]
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
LogTimeFormat=thread_id
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/jwt_hs256.key
### Database server configuration
StorageType=accounting_storage/mysql
StorageHost=<FILL ME>
StorageUser=<FILL ME>
StoragePass=<FILL ME>
Replace the <FILL ME>
according to your existing configuration.
- Seal the secret:
cfctl kubeseal
- Apply the SealedSecret:
kubectl apply -f argo/slurm-cluster/secrets/slurmdbd-conf-sealed-secret.yaml
3.b. Values: Enable SLURM DB
Edit the elm/slurm-cluster/values-<cluster name>.yaml
values
Let's add the values to deploy a SLURM DB.
db:
enabled: true
config:
secretName: slurmdbd-conf-secret
If you are using LDAPS and the CA is private:
db:
enabled: true
config:
secretName: slurmdbd-conf-secret
command: ['sh', '-c', 'update-ca-trust && /init']
volumeMounts:
- name: ca-cert
mountPath: /etc/pki/ca-trust/source/anchors/example.com.ca.pem
subPath: example.com.ca.pem
volumes:
- name: ca-cert
secret:
secretName: local-ca-secret
local-ca-secret
is a Secret containing example.com.ca.pem
.
You can already deploy it:
git add .
git commit -m "Added SLURM DB values"
git push
# This is optional if the application is already deployed.
kubectl apply -f argo/slurm-cluster/apps/slurm-cluster-<cluster name>-app.yaml
The service should be accessible at the address slurm-cluster-<cluster name>-db-0.slurm-cluster.svc.cluster.local
. Use that URL in the slurm config.
4. Slurm Controller Deployment
4.a. Volumes
We will use NFS. Feel free to use another type of storage.
- StorageClass (dynamic)
- PersistentVolume (static)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: controller-state-<cluster name>-nfs
namespace: slurm-cluster
labels:
app: slurm-controller
topology.kubernetes.io/region: <FILL ME> # <country code>-<city>
topology.kubernetes.io/zone: <FILL ME> # <country code>-<city>-<index>
provisioner: nfs.csi.k8s.io
parameters:
server: <FILL ME> # IP or host
share: <FILL ME> # /srv/nfs/k8s/slurmctl
mountPermissions: '0775'
mountOptions:
- hard
- nfsvers=4.1
- noatime
- nodiratime
volumeBindingMode: Immediate
reclaimPolicy: Retain
allowedTopologies:
- matchLabelExpressions:
- key: topology.kubernetes.io/region
values:
- <FILL ME> # <country code>-<city>
kubectl apply -f argo/slurm-cluster/volumes/controller-state-<cluster name>-nfs.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: controller-state-<cluster name>-pv
namespace: slurm-cluster
labels:
app: slurm-controller
topology.kubernetes.io/region: <FILL ME> # <country code>-<city>
topology.kubernetes.io/zone: <FILL ME> # <country code>-<city>-<index>
spec:
capacity:
storage: 10Gi
mountOptions:
- hard
- nfsvers=4.1
- noatime
- nodiratime
csi:
driver: nfs.csi.k8s.io
readOnly: false
volumeHandle: <unique id> # uuidgen
volumeAttributes:
server: <FILL ME> # IP or host
share: <FILL ME> # /srv/nfs/k8s/slurmctl
mountPermissions: '0775'
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
kubectl apply -f argo/slurm-cluster/volumes/controller-state-<cluster name>-pv.yaml
The label app=slurm-controller
will be used by the PersistentVolumeClaim.
4.b. Values: Enable SLURM Controller
Let's add the values to deploy a SLURM Controller.
- StorageClass (dynamic)
- PersistentVolume (static)
controller:
enabled: true
persistence:
storageClassName: 'controller-state-<cluster name>-nfs'
accessModes: ['ReadWriteOnce']
size: 10Gi
nodeSelector:
topology.kubernetes.io/region: <FILL ME> # <country code>-<city>
topology.kubernetes.io/zone: <FILL ME> # <country code>-<city>-<index>
resources:
requests:
cpu: '250m'
memory: '1Gi'
limits:
cpu:
memory: '1Gi'
controller:
enabled: true
persistence:
storageClassName: ''
accessModes: ['ReadWriteOnce']
size: 10Gi
selectorLabels:
app: slurm-controller
topology.kubernetes.io/region: <FILL ME> # <country code>-<city>
topology.kubernetes.io/zone: <FILL ME> # <country code>-<city>-<index>
nodeSelector:
kubernetes.io/hostname: <FILL ME>
topology.kubernetes.io/region: <FILL ME> # <country code>-<city>
topology.kubernetes.io/zone: <FILL ME> # <country code>-<city>-<index>
resources:
requests:
cpu: '250m'
memory: '1Gi'
limits:
cpu:
memory: '1Gi'
Notice that kubernetes.io/hostname
is used, this is because the slurm controller will be using the host network and we don't want to make the slurm controller move around.
We might develop a HA setup in the future version of ClusterFactory.
If you are using LDAPS and the CA is private, append these values:
controller:
# ...
command: ['sh', '-c', 'update-ca-trust && /init']
volumeMounts:
- name: ca-cert
mountPath: /etc/pki/ca-trust/source/anchors/example.com.ca.pem
subPath: example.com.ca.pem
volumes:
- name: ca-cert
secret:
secretName: local-ca-secret
local-ca-secret
is a Secret containing example.com.ca.pem
.
You can already deploy it:
git add .
git commit -m "Added SLURM Controller values"
git push
# This is optional if the application is already deployed.
kubectl apply -f argo/slurm-cluster/apps/slurm-cluster-<cluster name>-app.yaml
The SLURM controller is in host mode using hostPort
so it can communicate with the bare-metal hosts. There
is also a SLURM controller Service
running for the internal communication with Slurm DB and Slurm Login.
4.c Testing: sinfo
from the controller node
You should be able to run a kubectl exec
session on the controller node and execute sinfo
:
[user@local /]> kubectl exec -it -n slurm-cluster slurm-cluster-<cluster-name>-controller-0 -c slurm-cluster-<cluster-name>-controller -- bash
[root@slurm-cluster-reindeer-controller-0 /]> sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
main* up infinite 12 down* cn[1-12]
5. Slurm Compute Bare-Metal Deployment
5.a. Build an OS Image with Slurm
We have enabled config-less
in the slurm.conf
.
We need to build an OS Image with Slurm Daemon installed.
Using the packer-recipes
directory, we can create a recipe called compute.my-cluster.json
:
{
"variables": {
"boot_wait": "3s",
"disk_size": "50G",
"iso_checksum": "53a62a72881b931bdad6b13bcece7c3a2d4ca9c4a2f1e1a8029d081dd25ea61f",
"iso_url": "https://download.rockylinux.org/vault/rocky/8.4/isos/x86_64/Rocky-8.4-x86_64-boot.iso",
"memsize": "8192",
"numvcpus": "8"
},
"builders": [
{
"type": "qemu",
"accelerator": "kvm",
"communicator": "none",
"boot_command": [
"<up><tab><bs><bs><bs><bs><bs> ",
"inst.ks=http://{{ .HTTPIP }}:{{ .HTTPPort }}/ks.my-cluster.cfg ",
"inst.cmdline",
"<enter><wait>"
],
"boot_wait": "{{ user `boot_wait` }}",
"disk_size": "{{ user `disk_size` }}",
"iso_url": "{{ user `iso_url` }}",
"iso_checksum": "{{ user `iso_checksum` }}",
"headless": true,
"cpus": "{{ user `numvcpus` }}",
"memory": "{{ user `memsize` }}",
"vnc_bind_address": "0.0.0.0",
"shutdown_timeout": "3h",
"shutdown_timeout": "1h",
"qemuargs": [["-serial", "stdio"]]
}
]
}
Create also the ks.my-cluster.cfg
in the http
directory:
url --url="https://dl.rockylinux.org/pub/rocky/9.0/BaseOS/x86_64/os/"
# License agreement
eula --agreed
# Disable Initial Setup on first boot
firstboot --disable
# Poweroff after the install is finished
poweroff
# Firewall
firewall --disable
# Disable Initial Setup on first boot
firstboot --disable
ignoredisk --only-use=vda
# Use SSSD
authselect select sssd with-mkhomedir with-sudo
# System language
lang en_US.UTF-8
# Keyboard layout
keyboard us
# Network information
network --bootproto=dhcp --device=eth0
# SELinux configuration
selinux --disabled
# System timezone
timezone UTC --utc
# System bootloader configuration
bootloader --location=mbr --driveorder="vda" --timeout=1
# Root password
rootpw --plaintext an_example_of_default_password
# System services
services --enabled="chronyd"
repo --name="AppStream" --baseurl=https://dl.rockylinux.org/pub/rocky/9.0/AppStream/x86_64/os/
repo --name="Extras" --baseurl=https://dl.rockylinux.org/pub/rocky/9.0/extras/x86_64/os/
repo --name="CRB" --baseurl=https://dl.rockylinux.org/pub/rocky/9.0/CRB/x86_64/os/
repo --name="epel" --baseurl=https://mirror.init7.net/fedora/epel/9/Everything/x86_64/
repo --name="deepsquare" --baseurl=https://yum.deepsquare.run/9/
# Clear the Master Boot Record
zerombr
# Remove partitions
clearpart --all --initlabel
# Automatically create partition
part / --size=1 --grow --asprimary --fstype=xfs
# Postinstall
%post --erroronfail
set -ex
mkdir /opt/xcat
# Install xCat provisioning service
curl -fsSL "https://raw.githubusercontent.com/xcat2/xcat-core/master/xCAT/postscripts/xcatpostinit1.netboot" -o /opt/xcat/xcatpostinit1
chmod 755 /opt/xcat/xcatpostinit1
curl -fsSL "https://raw.githubusercontent.com/xcat2/xcat-core/master/xCAT/postscripts/xcatpostinit1.service" -o /etc/systemd/system/xcatpostinit1.service
ln -s "../xcatpostinit1.service" "/etc/systemd/system/multi-user.target.wants/xcatpostinit1.service"
# Postinstall
#-- Pam mkhomedir: auto create home folder for ldap users
sed -Ei 's|UMASK\t+[0-9]+|UMASK\t\t027|g' /etc/login.defs
#-- Secure umask for newly users
echo 'umask 0027' >> /etc/profile
# Kickstart copies install boot options. Serial is turned on for logging with
# Packer which disables console output. Disable it so console output is shown
# during deployments
sed -i 's/^GRUB_TERMINAL=.*/GRUB_TERMINAL_OUTPUT="console"/g' /etc/default/grub
sed -i '/GRUB_SERIAL_COMMAND="serial"/d' /etc/default/grub
sed -ri 's/(GRUB_CMDLINE_LINUX=".*)\s+console=ttyS0(.*")/\1\2/' /etc/default/grub
# Clean up install config not applicable to deployed environments.
for f in resolv.conf fstab; do
rm -f /etc/$f
touch /etc/$f
chown root:root /etc/$f
chmod 644 /etc/$f
done
cat << EOF >>/etc/fstab
devpts /dev/pts devpts gid=5,mode=620 0 0
tmpfs /dev/shm tmpfs defaults 0 0
proc /proc proc defaults 0 0
sysfs /sys sysfs defaults 0 0
EOF
rm -f /etc/sysconfig/network-scripts/ifcfg-[^lo]*
dnf clean all
%end
%packages
@minimal-environment
chrony
# kernel
kernel-5.14.0-70.22.1.el9_0.x86_64
kernel-devel-5.14.0-70.22.1.el9_0.x86_64
kernel-headers-5.14.0-70.22.1.el9_0.x86_64
kernel-tools-5.14.0-70.22.1.el9_0.x86_64
kernel-modules-5.14.0-70.22.1.el9_0.x86_64
kernel-core-5.14.0-70.22.1.el9_0.x86_64
kernel-modules-extra-5.14.0-70.22.1.el9_0.x86_64
bash-completion
cloud-init
# cloud-init only requires python3-oauthlib with MAAS. As such upstream
# removed this dependency.
python3-oauthlib
rsync
tar
# disk growing
cloud-utils-growpart
# grub2-efi-x64 ships grub signed for UEFI secure boot. If grub2-efi-x64-modules
# is installed grub will be generated on deployment and unsigned which breaks
# UEFI secure boot.
grub2-efi-x64
efibootmgr
shim-x64
dosfstools
lvm2
mdadm
device-mapper-multipath
iscsi-initiator-utils
dnf-plugins-core
# other packages
net-tools
nfs-utils
openssh-server
rsync
tar
util-linux
wget
python3
tar
bzip2
bc
dracut
dracut-network
rsyslog
hostname
e2fsprogs
ethtool
parted
openssl
dhclient
openssh-clients
bash
vim-minimal
rpm
iputils
perl-interpreter
gawk
xz
squashfs-tools
cpio
sudo
make
bash-completion
nano
pciutils
git
mlocate
sssd
vim-enhanced
systemd-udev
numactl
munge
libevent-devel
tmux
oddjob
oddjob-mkhomedir
redis
unzip
nmap
flex
tk
bison
libgfortran
tcl
gcc-gfortran
libcurl
libnl3-devel
python39
numactl-libs
xfsprogs
zsh
#pkgconf-pkg-config
rpm-build
hwloc
hwloc-libs
hwloc-devel
tcsh
ksh
xorg-x11-fonts-ISO8859-1-75dpi.noarch
xorg-x11-fonts-cyrillic.noarch
# otherpkgs
htop
pmix4
slurm
slurm-contribs
slurm-libpmi
slurm-pam_slurm
slurm-slurmd
# beeond build dependency
elfutils-libelf-devel
-plymouth
# Remove Intel wireless firmware
-i*-firmware
%end
Build the image with:
packer build compute.my-cluster.json
And send the os image to xcat. Follow the guide "Build an OS Image with Packer" for more details.
5.b. xCAT Postbootscripts
Next, you have to configure a service by using a xCAT postscript. Our recommendation is to use a xCAT postscript to pull a Git repository which, based on the content of the repository, copies the files and executes the postscripts in that Git repository.
This way, the GitOps practice is always followed and permits to adapt for the future version of ClusterFactory.
The service:
[Unit]
Description=Slurm node daemon
After=network.target munge.service
[Service]
Type=forking
ExecStartPre=/usr/bin/id slurm
Restart=always
RestartSec=3
ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd --conf-server <node host IP>
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target
A simple postbootscript:
#!/bin/sh -ex
mkdir -p /var/log/slurm/
cat <<\END | base64 -d >/etc/munge/munge.key
...
END
chmod 600 /etc/munge/munge.key
cat <<\END >/etc/sssd/sssd.conf
...
END
chmod 600 /etc/sssd/sssd.conf
#-- Add enroot extra hooks for PMIx and PyTorch multi-node support
cp /usr/share/enroot/hooks.d/50-slurm-pmi.sh /usr/share/enroot/hooks.d/50-slurm-pytorch.sh /etc/enroot/hooks.d
# Enable Pyxis (container jobs)
cat <<\END >/etc/slurm/plugstack.conf.d/
optional /usr/lib64/slurm/spank_pyxis.so runtime_path=/run/pyxis container_scope=job
END
cat <<\END >/etc/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service remote-fs.target
Wants=network-online.target
[Service]
Type=simple
Restart=always
RestartSec=3
ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd --conf-server <node host IP>
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
StandardOutput=null
StandardError=null
[Install]
WantedBy=multi-user.target
END
#-- Wait for LDAP
update-ca-trust
systemctl restart sssd
while ! id slurm
do
sleep 1
done
systemctl daemon-reload
systemctl restart munge
systemctl enable slurmd
systemctl start --no-block slurmd
After setup SLURM, you should also:
- Mount the home directory of the LDAP users (probably like
/home/ldap-users
) - Use the postscript to configure SSSD
- Use the postscript to import the
munge.key
5.c. Reboot the nodes
If the controller is running, the nodes should automatically receive the slurm.conf
inside /run/slurm/conf
.
6. Slurm Login Deployment
6.a. Secrets and Volumes
SSH Server configuration
The login nodes can be exposed to the external network using Multus CNI and the IPVLAN plugin. This is to expose the srunPortRange
and the SSH port.
If you don't plan to use srun
and prefer sbatch
, we recommend to use a simple Kubernetes Service to expose the login nodes.
Thanks to SSSD, the users can log in to the nodes using SSH using the passwords stored in LDAP.
We have to generate the SSH host keys:
yes 'y' | ssh-keygen -N '' -f ./ssh_host_rsa_key -t rsa -C login-node
yes 'y' | ssh-keygen -N '' -f ./ssh_host_ecdsa_key -t ecdsa -C login-node
yes 'y' | ssh-keygen -N '' -f ./ssh_host_ed25519_key -t ed25519 -C login-node
6 files will be generated. We will also add our sshd_config
.
- Create a
-secret.yaml.local
file:
apiVersion: v1
kind: Secret
metadata:
name: login-sshd-secret
namespace: slurm-cluster
type: Opaque
stringData:
ssh_host_ecdsa_key: |
-----BEGIN OPENSSH PRIVATE KEY-----
<FILL ME>
-----END OPENSSH PRIVATE KEY-----
ssh_host_ecdsa_key.pub: |
ecdsa-sha2-nistp256 <FILL ME>
ssh_host_ed25519_key: |
-----BEGIN OPENSSH PRIVATE KEY-----
<FILL ME>
-----END OPENSSH PRIVATE KEY-----
ssh_host_ed25519_key.pub: |
ssh-ed25519 <FILL ME>
ssh_host_rsa_key: |
-----BEGIN OPENSSH PRIVATE KEY-----
<FILL ME>
-----END OPENSSH PRIVATE KEY-----
ssh_host_rsa_key.pub: |
ssh-rsa <FILL ME>
sshd_config: |
Port 22
AddressFamily any
ListenAddress 0.0.0.0
ListenAddress ::
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_ecdsa_key
HostKey /etc/ssh/ssh_host_ed25519_key
PermitRootLogin prohibit-password
PasswordAuthentication yes
# Change to yes to enable challenge-response passwords (beware issues with
# some PAM modules and threads)
ChallengeResponseAuthentication no
UsePAM yes
X11Forwarding yes
PrintMotd no
AcceptEnv LANG LC_*
# override default of no subsystems
Subsystem sftp /usr/lib/openssh/sftp-server
AuthorizedKeysCommand /usr/bin/sss_ssh_authorizedkeys
AuthorizedKeysCommandUser root
Replace the <FILL ME>
with the values based on the generated files.
- Seal the secret:
cfctl kubeseal
- Apply the SealedSecret:
kubectl apply -f argo/slurm-cluster/secrets/login-sshd-sealed-secret.yaml
Home directory for the LDAP users
If you have configured your LDAP server, you might have to change the homeDirectory
to something like /home/ldap-users
.
You must mount the home directory of the LDAP users using NFS.
DO NOT use StorageClass
since the provisioning is static. We don't want to create a volume per replica. There is only one common volume.
apiVersion: v1
kind: PersistentVolume
metadata:
name: ldap-users-<cluster name>-pv
namespace: slurm-cluster
labels:
app: slurm-login
topology.kubernetes.io/region: <FILL ME> # <country code>-<city>
topology.kubernetes.io/zone: <FILL ME> # <country code>-<city>-<index>
spec:
capacity:
storage: 1000Gi
mountOptions:
- hard
- nfsvers=4.1
- noatime
- nodiratime
csi:
driver: nfs.csi.k8s.io
readOnly: false
volumeHandle: <unique id> # uuidgen
volumeAttributes:
server: <FILL ME> # IP or host
share: <FILL ME> # /srv/nfs/k8s/ldap-users
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ldap-users-<cluster name>-pvc
namespace: slurm-cluster
labels:
app: slurm-login
topology.kubernetes.io/region: <FILL ME> # <country code>-<city>
topology.kubernetes.io/zone: <FILL ME> # <country code>-<city>-<index>
spec:
volumeName: ldap-users-<cluster name>-pv
accessModes: [ReadWriteMany]
storageClassName: ''
resources:
requests:
storage: 1000Gi
Apply the PV and PVC:
kubectl apply -f argo/slurm-cluster/volumes/ldap-users-<cluster name>-pv.yaml
6.b. Values: Enable SLURM Login
login:
enabled: true
replicas: 1
sshd:
secretName: login-sshd-secret
nodeSelector:
topology.kubernetes.io/region: <FILL ME> # <country code>-<city>
topology.kubernetes.io/zone: <FILL ME> # <country code>-<city>-<index>
# Extra volume mounts
volumeMounts:
- name: ldap-users-pvc
mountPath: /home/ldap-users
# Extra volumes
volumes:
- name: ldap-users-pvc
persistentVolumeClaim:
claimName: ldap-users-<cluster name>-pvc
service:
enabled: true
type: ClusterIP
# Use LoadBalancer to expose via MetalLB
# type: LoadBalancer
# annotations:
# metallb.universe.tf/address-pool: slurm-ch-basel-1-pool
# Expose via IPVLAN, can be unstable.
# Using IPVLAN permits srun commands.
net:
enabled: false
# Kubernetes host interface
type: ipvlan
masterInterface: eth0
mode: l2
# https://www.cni.dev/plugins/current/ipam/static/
ipam:
type: static
addresses:
- address: 192.168.0.20/24
gateway: 192.168.0.1
Edit the values accordingly.
Service or IPVLan?
A Kubernetes Service offers a lot of advantages while IPVLan offers a solution to a problem.
It is extremely recommended to use a Kubernetes service to expose your connection node as it provides load balancing and is easy to configure.
Kubernetes LoadBalancer Service | Multus CNI |
---|---|
A LoadBalancer service provides limited control over networking, as it only provides a single IP address for a Kubernetes service. | IPVLan with Multus allows you to have more fine-grained control over networking by enabling you to use multiple network interfaces in a pod, each with its own IP address and route table. |
A LoadBalancer service is a simple and straightforward way to expose a Kubernetes service to the internet. | Setting up IPVLan with Multus can be more complex than using a simple LoadBalancer service, as it requires more configuration and setup time. |
A LoadBalancer service can only expose a set of ports. | Using IPVLan with Multus will allow a pod to directly connect to the host network. |
As a result, using a Kubernetes LoadBalancer service will render the Slurm srun
commands inoperable (although sbatch
will work and is the preferred method for job submission). On the other hand, adopting Multus CNI eliminates the load balancing feature, but could lead to instability.
Because k8s-pod-network
is the default network, you must write routes to your networks.
For example, if we have two sites 10.10.0.0/24
and 10.10.1.0/24
, you would write:
ipam:
type: static
addresses:
- address: 192.168.0.20/24
gateway: 192.168.0.1
routes:
- dst: 10.10.1.0/24
If we kubectl exec
to the container and run ip route
, you would see:
# ip route
default via 169.254.1.1 dev eth0
10.10.0.0/24 via 10.10.0.1 dev net1
10.10.1.0/24 via 10.10.0.1 dev net1
169.254.1.1 dev eth0 scope link
10.10.0.0/20 via 10.10.0.1 dev net1
10.10.0.0/20 dev net1 proto kernel scope link src 10.10.0.21
The issue is tracked at SquareFactory/ClusterFactory#29 and projectcalico/calico#5199.
If you are using LDAPS and the CA is private, add these values:
login:
# ...
command: ['sh', '-c', 'update-ca-trust && /init']
volumeMounts:
- name: ldap-users-pvc
mountPath: /home/ldap-users
- name: ca-cert
mountPath: /etc/pki/ca-trust/source/anchors/example.com.ca.pem
subPath: example.com.ca.pem
volumes:
- name: ldap-users-pvc
persistentVolumeClaim:
claimName: ldap-users-<cluster name>-pvc
- name: ca-cert
secret:
secretName: local-ca-secret
local-ca-secret
is a Secret containing example.com.ca.pem
.
You can deploy it:
git add .
git commit -m "Added SLURM Login values"
git push
# This is optional if the application is already deployed.
kubectl apply -f argo/slurm-cluster/apps/slurm-cluster-<cluster name>-app.yaml
6.c Testing: Access to a SLURM Login node
Because the container is exposed to the external network, you should be able to ssh directly to the login node.
ssh user@login-node
If the LDAP User user
exists, the login node should be asking for a password.