Skip to content

Operations Runbook


How to restart safely

Postgres is on a Docker volume (postgres-db-volume) and will survive container restarts. DAG run history, Variables, Connections, and XComs are not lost.

cd /home/airflow
docker compose down
docker compose up -d

After bringing the stack up, check if all services are healthy:

docker compose ps
# All services should show "healthy" or "running"

The airflow-init service will run DB migrations again (no probs if there are no new migrations) and will return exit code 0 before the other services become fully available.

Wait for airflow-init to complete

The webserver and scheduler depend on airflow-init completing successfully (condition: service_completed_successfully). If airflow-init fails, the other services will not start. Check its logs:

docker compose logs airflow-init

Rebuilding the custom image

When the Dockerfile is modified (e.g., new Python package, system dependency):

cd /home/airflow
docker compose build --no-cache
docker compose up -d

Don't forget --no-cache.

After the rebuild, verify that the new package is available:

docker exec airflow-airflow-worker-1 python -c "import <new_package>"

Checking the daily schedule window

To confirm that today's pipeline is on track, check these in order in the Airflow UI:

  1. By 08:00 UTC: download_ifs_an, download_ifs_an00, download_ifs_fc should be green or yellow (retrying). If all three are red before 10:00 UTC, the IFS source may be unavailable.
  2. By 10:30 UTC: ifs_process (both Lucia and local) should be green.
  3. By 14:00 UTC: model_lucia_run should be green (model submitted to Slurm).
  4. By 18:00 UTC: model_lucia_postprocess should be green (model finished, files on marines server).
  5. By 16:00 UTC: count_files_MDS result tells you whether today's CMEMS products are published.

If any DAG is stuck in the retrying state at an unexpected hour, check the task logs in the UI to see what specific error is repeating.


Manually clearing and re-running a failed task

When a task fails and you want to re-run it after fixing the underlying problem:

  1. Open the Airflow UI.
  2. Go to the DAG, click on the failed DAG run.
  3. Click the failed task (red box).
  4. Click Clear and confirm. Airflow will re-queue the task.

To clear and re-run all tasks from a specific point forward:

  1. Click the task you want to restart from.
  2. Click Clear and enable the "Downstream" checkbox.

Clearing compress_results

If compress_results was interrupted mid-compression, check for .tmp orphan files before re-running:

find /mnt/md0/NRT_V2025/out/ -name '*.tmp' -type f

Delete any orphans, then re-run by clearing the compress_files task.


Recovering from a lost submitted_job Variable

If model_lucia_postprocess cannot find the submitted_job Variable (e.g., because the Variable was accidentally deleted, or model_lucia_run did not complete successfully):

  1. Log into Lucia via the gateway:

    ssh <lucia-gateway>
    ssh frontal "squeue -u lvdbk"
    
  2. Note the Slurm job ID from the queue listing.

  3. In the Airflow UI, set the Variable manually:

    Admin > Variables > New with key submitted_job and value = the Slurm job ID as an integer.

  4. Clear the get_job_number task in model_lucia_postprocess to re-run from that point.


Scheduled restart

There is no automatic scheduled restart currently configured. If the stack is restarted manually, check the Airflow UI for any DAG runs that were in-flight at the time of the restart. Tasks that were running when the stack went down will be marked failed by Airflow after it recovers. You may need to clear and re-run them.


Backups

Airflow metadata (Postgres)

The metadata database is in the postgres-db-volume Docker volume. To back it up:

docker exec airflow-postgres-1 pg_dump -U airflow airflow > airflow_backup_$(date +%Y%m%d).sql

This includes all DAG run history, Variables, Connections, and XComs.

Fernet key

The Fernet key is in /home/airflow/.env. Without it, encrypted Connection passwords stored in Postgres are unreadable. Back up this file securely and separately from the database.

Model data

Model output on /mnt/md0/NRT_V2025/out/ is not automatically backed up by Airflow. The GoogleCloud_backup_run DAG handles long-term model output archiving to GCS.


Known issues

Root containers

All service containers run as user: "0:0" (root). This is a temporary workaround for a file-permission issue on the /mnt/md0 mount. The TODO comment in docker-compose.yml tracks this. The fix requires setting the correct ownership on /mnt/md0 so that the standard AIRFLOW_UID=50000 user can write to it, then removing the user: "0:0" override.

numpy.ndarray size changed RuntimeWarning

Any RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility in task logs or CLI output is cosmetic. It does not affect computation results. It is caused by a mismatch between the compiled numpy ABI version and an extension loaded in the same environment. It can be silenced by rebuilding the image with consistent numpy and extension versions.

Orphan .tmp files from compress_results

An interrupted compress_results run leaves .tmp partial files in the NRT_V2025 output directory. The original NetCDF file may be gone if shutil.move already ran. If the original is missing, the only recovery path is re-transferring it from Lucia's GPFS.

Cleanup command:

find /mnt/md0/NRT_V2025/out/ -name '*.tmp' -delete