Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Session 2: BHT Cluster and Experiments Logging

Learning goals

  • Understand cluster concepts for Data Science with Kubernetes

    • Pods, Deployments, Services, Jobs, ...

  • Run interactive and batch workloads on our BHT cluster

  • Work resource-aware (CPU, RAM, GPU, storage)

  • Log experiments with Weights & Biases (WandB) to track while running on the cluster

BHT cluster

Before we start with the tutorial, lets take a look at the GPUs:

GPU Monitoring in Grafana

⚠️ Please be mindful of the resources you use on the cluster. PhDs and students are sharing the same cluster, so we need to be considerate of each other.

Kubernetes

Kubernetes
  • Open-source system for managing containerized apps

  • Deploys, keeps your apps alive

  • Second-largest open-source project after Linux

Kubernetes core concepts

  • Pod: Smallest deployable unit that runs docker containers

  • Job: One-off task that run to completion and then stop

  • Deployment: Keeps desired number of pods alive, restarts if they fail

  • PVC: Persistent Volume Claim, requests persistent storage

  • Secret: Stores sensitive info (like SSH keys)

  • Others: Namespaces, Cronjobs, Services ...

Docker concepts

Docker
  • Image: Blueprint for a container, includes code and dependencies

  • Container: Running instance of an image, isolated environment

  • Dockerfile: Instructions to build an image

  • Docker Hub: Public registry for sharing images

Example Dockerfile:

FROM python:3.10-slim # Use a parent image
WORKDIR /app # Set the working directory in the container
COPY . /app # Copy the current directory contents into the container at /app
RUN pip install --no-cache-dir -r requirements.txt # Install packages
CMD ["python", "main.py"] # Run main.py when the container launches

Prerequisites for working on our cluster

You should have done the following already (as described at the end of Session 1):

Recommended: VSCode (Tutorial will show an example for VSCode)

Setting up your namespace

For the following tutorial (and in general), it is easier to set your namespace as the default context in kubectl. This way, you don’t have to specify the namespace for every command.

To set your namespace as the default context, run the following command:

kubectl config set-context oidc_ds_cluster --namespace=<campus_account>
  • Replace <campus_account> with your actual campus account name.

First Python code on the cluster

Example: Printing “Hello from the cluster!” using a Job
Step 1: Download the Cluster Files (.zip) file from the downloads section.
Step 2: Store it somewhere on your computer and open hello-cluster-job.yml with an editor (e.g. VSCode).
Step 3: Send it to the cluster:

kubectl apply -f hello-cluster-job.yml

Step 4: Check the status of your job:

kubectl get jobs

Step 5: Check the logs of your job:

kubectl logs job/hello-cluster-job

You should see the following output: Hello from the cluster!

Step 6: Clean up the job from the cluster:

kubectl delete job hello-cluster-job

Prototyping Workflow: SSH into your pod with VSCode

  • You can use VSCode’s Remote Development Extension to SSH into your pod and work on the cluster directly from your editor.

  • This allows you to run code, edit files, and monitor resources without leaving VSCode.

Prerequisites

  • A password-protected SSH key pair for authentication

  • A Kubernetes secret that injects the public SSH key into the pod

Step 1: Generate a new SSH key pair

ssh-keygen
  • Follow the prompts to create a new key pair (e.g., id_rsa_cluster and id_rsa_cluster.pub)

  • Make sure to set a password for the private key

  • Save the keys in a secure location on your computer (usually in ~/.ssh/)

Step 2: Create a Kubernetes secret with your public SSH key (ends with .pub)

kubectl create secret generic ssh-key-secret --from-file=authorized_keys=~/.ssh/id_rsa_cluster.pub
  • Replace ~/.ssh/id_rsa_cluster.pub with the actual path to your public SSH key if it’s different

  • This command creates a secret on the cluster named ssh-key-secret that contains your public SSH key

  • This can be mounted into your pod to allow SSH access, check it with:

kubectl get secrets

Step 3: Create a local SSH config file (if you don’t have one already)
For Mac/Linux:

touch ~/.ssh/config

For Windows (PowerShell):

New-Item -Path $env:USERPROFILE\.ssh\config -ItemType File

Open the config file in an editor (tip for VSCode: code ~/.ssh/config) and add the following configuration:

Host bht-cluster
    HostName localhost # We connect through port forwarding
    Port 2222 # This is the port you will forward to your pod
    User root
    IdentityFile ~/.ssh/id_rsa_cluster # Path to your private SSH key

Prerequisites are now set up, next we can setup a pod.

Setting up Your Own Pod (Docker + Kubernetes)

Step 1: Claim a persistent volume with a PVC, for storing code and data:

  • Open the file storage-pvc.yml from the downloaded cluster files:

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    name: dsw-pvc
    spec:
    accessModes:
        - ReadWriteOnce # Use ReadWriteMany if you have multiple Pods needing to write to the volume
    resources:
        requests:
        storage: 5Gi # Start low, you can always increase but NOT decrease the storage size later
  • Apply it to the cluster:

    kubectl apply -f storage-pvc.yml
  • This creates a persistent volume claim named dsw-pvc that requests 5GB of storage. You can check the status of your PVC with:

    kubectl get pvc

Step 2: Create a docker image with an SSH server

  • Open the file Dockerfile from the downloaded cluster files:

  • Navigate to the directory where you downloaded the Dockerfile and build the image:

    docker build -t ssh-server-image --platform linux/amd64 .
  • Start the container locally to test it:

    docker run -p 2222:22 ssh-server-image
  • This command runs the container and forwards port 22 inside the container to port 2222 on your local machine, allowing you to SSH into it using the configuration we set up earlier.

Step 3: Push the image to a container registry
You need to push your image to a container registry that our cluster can access.

  1. Example: Docker Hub (PUBLIC)

    • Name your image with your Docker Hub username:

      docker tag ssh-server-image <your-dockerhub-username>/ssh-server-image:latest
    • Optional: If not already done via e.g. Docker Desktop, login:

      docker login hub.docker.com
    • Push to Docker Hub:

      docker push <your-dockerhub-username>/ssh-server-image:latest
  2. For private registries, have a look at the cluster documentation: https://docs.cluster.ris.bht-berlin.de/user/images/

Step 4: Create a Kubernetes deployment that uses your image and mounts the PVC

  • Open the file remote-deployment.yml from the downloaded cluster files:

  • Modify the following parts:

    • Under containers.image, replace your-dockerhub-username/ssh-server-image:latest with the name of your image in the container registry.

    • Under volumes, make sure the claimName matches the name of your PVC (e.g., dsw-pvc).

    • Under volumes, make sure the secretName matches the name of your SSH key secret (e.g., ssh-key-secret).

  • Apply the deployment to the cluster:

    kubectl apply -f remote-deployment.yml
  • Check the status of your deployment:

    kubectl get deploy
  • Check the status of your pod:

    kubectl get pods
  • Check whether your pod is using the volume:

    kubectl describe pod <pod-name>

Once your pod is running you can proceed and connect to it via SSH.

Step 5: Port forward to your pod to enable SSH access
Kubernetes pods are not directly accessible from outside the cluster. To SSH into your pod, you need to set up port forwarding from your local machine to the pod.

  1. Start port-forwarding in a terminal:

    kubectl port-forward <pod-name> 2222:22
  • Replace <pod-name> with the actual name of your pod (you can get it from kubectl get pods)

  • This command forwards port 22 in the pod to port 2222 on your local machine.

  • Keep this terminal running as long as you want to have SSH access. It’s your live bridge to the pod.

  1. Connect to the pod using SSH:

    ssh bht-cluster
  • This uses the SSH configuration we set up earlier to connect to the pod through the forwarded port.

  • You should be prompted for the password of your SSH key.

Congratulations! You are now connected to your pod on the cluster via SSH!

Step 6: Connect your VSCode to the pod

  • In VSCode download the “Remote Development” extension pack if you haven’t already.

  • Click on the “Remote Window Icon” (“><”) in the bottom left corner or press ‘Ctrl+Shift+P’ and select “Remote-SSH: Connect to Host...”.

    Connect to Host
  • On the dropdown, select “bht-cluster” (or whatever you named your host in the SSH config) and enter the password for your SSH key when prompted.

  • Once connected, you might want to install some VSCode extensions in the remote environment (e.g. Python extension) to make your life easier when working on the cluster.

    • As a tip you can install all extensions from your local VSCode to remote through the command palette (Ctrl+Shift+P) and select “Remote-SSH: Install Local Extensions in Remote...”.

      Install Extensions
  • Often times you need to reload the window. For this you can press ‘Ctrl+Shift+P’ and select “Reload Window”.

You can now open terminals, edit files, and run code on the cluster directly from VSCode! This allows you to work on the cluster as if it were your local machine.

Don’t forget

  • We have mounted the PVC to /storage in the container, so save everything you want to keep on the cluster to that directory.

  • EVERYTHING outside will be lost once the pod is restarted.

Step 7: IMPORTANT! Always close your deployment pod once you’re done!

  • Unlike jobs, deployments will keep running and don’t shut down automatically.

  • They keep blocking resources! (People will hate you for that, especially if you are using GPUs)

    You can scale down your deployment to zero replicas to stop it:

    kubectl scale deployment dsw-deployment --replicas=0

    or you can delete the deployment entirely:

    kubectl delete deploy dsw-deployment
  • Next time you want to use it again, you can scale it back up:

    kubectl scale deployment dsw-deployment --replicas=1

    or re-apply the deployment file:

    kubectl apply -f dsw-deployment.yml

Connect a terminal to your pod

  • You can also connect a terminal to your pod without SSH, using kubectl exec:

    kubectl exec -it <pod-name> -- /bin/bash

How to get code and data onto the cluster?

There are many ways, here are a few common ones:

  1. Docker Image: You can build a Docker image that contains your code and dependencies, by copying into the image:

    COPY <src-path> <destination-path>
    • Data can also be copied into the image, but is not recommended for large datasets. Better to use a PVC for that.

  2. Git: You can clone a Git repository directly inside your pod. This also allows for version control.

    • However, you need to set up Git credentials and authenticate it every time your pod restarts.

  1. Kubectl cp: You can copy files from your local machine to the pod using kubectl cp:

    kubectl cp <local-file-path> <pod-name>:<destination-path>
    • This is very useful, you can also copy files directly into the PVC if it is mounted in the pod.

    • However, depending on your Upload speed and the size of the files, this can be slow.

  2. Curl: You can use curl to download files directly into the pod:

    curl -o <destination-file-path> <file-url>
    • Download speed on the cluster is really fast, so this is a good option for large files.

    • You can also download files from the BHT cloud (Nextcloud) like that, if you have a public share link for the file or folder.

    • If you download a folder structure as a zip file, you need to unzip it on the cluster. Download the unzip package if it’s not already available.

How to keep processes running after disconnecting or closing the terminal?

Use screen for creating detachable terminal sessions on the cluster. Your processes will keep running even if you disconnect or close the terminal.

  • Start a session: screen -S dsw

  • Reconnect later: screen -r dsw

  • List active sessions: screen -ls

For Jupyter Notebook fans

You can work in VSCode on the cluster with Jupyter notebooks, as well. Your environment just needs to include the juypter dependancies and the “Jupyter” extension.

Alternatively, you can use the managed Jupyterhub on the cluster: https://jupyter.cluster.ris.bht-berlin.de

Final Advice: Cluster Documentation is your friend

If you encounter any issues, check the cluster documentation:

Cluster Prototyping vs Cluster Jobs

Prototyping

  • Fast for exploration and debugging

  • Best when requirements are still changing

  • Risk: manual steps are harder to repeat

Jobs

  • Best for repeatable training or evaluation runs

  • Runs unattended and is easier to reproduce

  • Risk: slower to debug and needs more setup

Rule of thumb: prototype first, then move stable workflows into jobs.

Resource awareness

How to request resources for your workloads on the cluster?
Important:

  • Always request only what you need, especially for GPUs!

  • When using GPUs, your job / deployment will be automatically killed when idling for more than 4h.

GPU request example

    ...
      resources:
        requests:
          nvidia.com/gpu: 1 # Request 1 GPU
        limits:
          nvidia.com/gpu: 1
    ...
    nodeSelector:
        gpu: k80
    ...

PVC example

  resources:
    requests:
      storage: 10Gi

Experiment logging with WandB

WandB
  • Compare experiments across parameter choices

  • Track artifacts, metrics, and training curves online in a web interface

  • Enable reproducibility and team collaboration

  • Hyperparameter sweeps for automatic tuning

Create an account here: https://wandb.ai

Minimal setup

Install the dependency and login to your WandB account:

pip install wandb
wandb login

Short example:

import wandb

wandb.init(project="dsw-2026", config={"lr": 1e-3, "batch_size": 32})
wandb.log({"train_loss": 0.42, "val_iou": 0.71})
wandb.finish()

WandB example on the cluster

Download the wandb_example.py file from the downloads section and run it on the cluster.

Bonus: Use time.sleep(2)to simulate longer training time and run the script in a screen session on the cluster. This lets you see the training progress live in the WandB dashboard, while you can disconnect.

WandB Dashboard

Summary Session 2

  • BHT cluster provides powerful resources for Data Science workloads

  • Kubernetes and Docker concepts

  • Prototyping workflow with SSH and VSCode Remote Development

  • Resource-aware workloads

  • Experiment logging with Weights & Biases (WandB) to track while running on the cluster